Senior Site Reliability Engineer

Varda Space Industries•El Segundo, CA

52d•Onsite

About The Position

At Varda Space Industries, we're pushing the boundaries of what's possible in space and materials science — and we’re looking for bold engineers to help us get there. As a Senior Site Reliability Engineer, you'll be critical in building, scaling, and maintaining the infrastructure that powers our systems on Earth, in orbit, and everything in between. We are looking for an experienced engineer with deep working knowledge of Kubernetes and containerized technologies. You are a hands-on operator and builder who applies first-principles thinking to both software delivery (DevOps) and production reliability (SRE), and thrives in complex, mission-critical environments.

Requirements

Bachelor’s degree in computer science, engineering, or related STEM field with 5+ years of Site Reliability Engineering experience, or 7+ years of progressive experience in DevOps, SRE, or Systems Engineering in lieu of a degree.
Experience with Infrastructure as Code (IaC) using tools like Terraform to automate server provisioning and configuration management
Experience operating Kubernetes or similar container orchestration platforms in production environments.
Experience with Prometheus, Grafana, InfluxDB, or similar technologies.
Knowledge of software-defined networking (VPC, Subnets, Firewalls, VPNs, etc.)
Python, Bash, PowerShell (or similar) scripting experience
Positive and strong communication skills, both written and oral

Nice To Haves

Experience in provisioning and managing scalable Azure cloud infrastructure using native tools and best practices
Experience implementing configuration management, provisioning, and workflow automation solutions via Infrastructure as Code, CI/CD, and GitOps (e.g., Ansible, Salt, ArgoCD, etc).
Strong understanding of Linux systems and container runtimes (e.g., containerd, Docker)
Experience with GPU workloads or high-throughput computing.
Hands-on experience operating and optimizing High Performance Computing (HPC) environments, including workload schedulers such as Slurm (e.g., queue/partition design, fair-share scheduling, and cluster resource management).
Experience with hybrid environments (cloud + on-prem or edge systems)
Experience debugging distributed systems at scale (network, storage, latency)
Experience with databases and data modeling

Responsibilities

Deploy, maintain, and operate mission-critical applications and infrastructure supporting spacecraft and company-wide systems.
Build and evolve Infrastructure as Code (IaC) frameworks using tools such as Terraform
Implement and operate observability systems (metrics, logging, tracing) and actionable alerting.
Build and maintain CI/CD pipelines to enable safe, repeatable, and rapid deployments.
Partner with software and hardware engineers to deliver highly operable, reliable, and scalable systems and pipelines, ensuring they have the tools and infrastructure needed for rapid iteration.
Identify, analyze, and resolve system bottlenecks and reliability risks; perform performance tuning and implement long-term stability improvements.
Respond to and resolve production incidents; perform root cause analysis and drive corrective actions through blameless postmortems.
Rotate through the team’s on-call schedule to keep critical systems healthy and responsive.
Must be willing to work extended hours and weekends as needed
Occasionally travel to customer sites and other Varda locations to troubleshoot, deploy, or test critical infrastructure.

Benefits

Equity in a fully funded space startup with potential for significant growth (interns excluded)
401(k) matching (interns excluded)
Unlimited PTO (interns excluded)
Health insurance, including Vision and Dental
Lunch and snacks provided on site every day. Dinners provided twice a week.
Maternity / Paternity leave (interns excluded)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume