Principal Site Reliability Engineer

Varda•El Segundo, CA

81d•$153,000 - $185,000

About The Position

As a Principal Site Reliability Engineer, you will help set the technical vision and strategy for reliability across spacecraft, ground systems, and enterprise platforms. You'll define standards, mentor senior engineers, and drive cross-organizational initiatives to ensure systems are highly operable, secure, and mission-ready. This role combines deep technical expertise with the ability to influence architectural direction at the company level.

Requirements

10+ years of experience in SRE, DevOps, or systems engineering, including leadership of large-scale, mission-critical systems.
Experience leading technical direction and architecture for large-scale systems.
Hands-on experience with observability stacks and telemetry pipelines-including metrics collection, alerting, and dashboards-for Linux systems and Kubernetes workloads (e.g., Prometheus and Grafana).
Strong background in systems architecture and software-defined networking (VPC, subnets, firewalls, VPNs, etc.).
Proficiency in automation and scripting with Python, Bash, or similar languages.
Positive and strong communication skills, both written and oral.

Nice To Haves

Expertise in time-series databases (e.g., InfluxDB) for large-scale telemetry pipeline.
Expertise in provisioning and managing scalable Azure cloud infrastructure using native tools and best practices (Azure GCC High preferred).
Experience with IaC tools like Terraform, and Ansible and CI/CD systems like Git and ArgoCD.
Experience building and maintaining dynamic system configurations with templating frameworks such as YAML, and Helm.
Strong understanding of Linux systems, containerization technologies, and Kubernetes internals.

Responsibilities

Lead and contribute hands-on to the deployment, maintenance, and operations of mission-critical applications and infrastructure supporting spacecraft, ground systems, and company-wide platforms.
Design, execute, and manage highly scalable, reliable, and operable software and infrastructure platforms, applying Infrastructure as Code (IaC) principles to drive automation, consistency, and repeatability across Kubernetes environments.
Collaborate closely with software and hardware teams to align reliability best practices, CI/CD pipelines, and compliance with their workflows, enabling faster, more secure deployments for mission-critical systems.
Anticipate and address reliability risks, capacity challenges, and performance bottlenecks; develop long-term strategies in partnership with leadership.
Rotate through the team's on-call schedule to keep critical systems healthy and responsive.
Occasionally travel to customer sites and other Varda locations to troubleshoot, deploy, or test critical infrastructure.

Benefits

Exciting team of professionals at the top of their field working by your side.
Equity in a fully funded space startup with potential for significant growth (interns excluded).
401(k) matching (interns excluded).
Unlimited PTO (interns excluded).
Health insurance, including Vision and Dental.
Lunch and snacks provided on site every day. Dinners provided twice a week.
Maternity / Paternity leave (interns excluded).

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Professional, Scientific, and Technical Services

Number of Employees

51-100 employees

Principal Site Reliability Engineer

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company