Senior Site Reliability Engineer

Varda Space IndustriesEl Segundo, CA
Onsite

About The Position

At Varda Space Industries, we're pushing the boundaries of what's possible in space and materials science — and we’re looking for bold engineers to help us get there. As a Senior Site Reliability Engineer, you'll be critical in building, scaling, and maintaining the infrastructure that powers our systems on Earth, in orbit, and everything in between. We are looking for an experienced engineer with deep working knowledge of Kubernetes and containerized technologies. You are a hands-on operator and builder who applies first-principles thinking to both software delivery (DevOps) and production reliability (SRE), and thrives in complex, mission-critical environments.

Requirements

  • Bachelor’s degree in computer science, engineering, or related STEM field with 5+ years of Site Reliability Engineering experience, or 7+ years of progressive experience in DevOps, SRE, or Systems Engineering in lieu of a degree.
  • Experience with Infrastructure as Code (IaC) using tools like Terraform to automate server provisioning and configuration management
  • Experience operating Kubernetes or similar container orchestration platforms in production environments.
  • Experience with Prometheus, Grafana, InfluxDB, or similar technologies.
  • Knowledge of software-defined networking (VPC, Subnets, Firewalls, VPNs, etc.)
  • Python, Bash, PowerShell (or similar) scripting experience
  • Positive and strong communication skills, both written and oral

Nice To Haves

  • Experience in provisioning and managing scalable Azure cloud infrastructure using native tools and best practices
  • Experience implementing configuration management, provisioning, and workflow automation solutions via Infrastructure as Code, CI/CD, and GitOps (e.g., Ansible, Salt, ArgoCD, etc).
  • Strong understanding of Linux systems and container runtimes (e.g., containerd, Docker)
  • Experience with GPU workloads or high-throughput computing.
  • Hands-on experience operating and optimizing High Performance Computing (HPC) environments, including workload schedulers such as Slurm (e.g., queue/partition design, fair-share scheduling, and cluster resource management).
  • Experience with hybrid environments (cloud + on-prem or edge systems)
  • Experience debugging distributed systems at scale (network, storage, latency)
  • Experience with databases and data modeling

Responsibilities

  • Deploy, maintain, and operate mission-critical applications and infrastructure supporting spacecraft and company-wide systems.
  • Build and evolve Infrastructure as Code (IaC) frameworks using tools such as Terraform
  • Implement and operate observability systems (metrics, logging, tracing) and actionable alerting.
  • Build and maintain CI/CD pipelines to enable safe, repeatable, and rapid deployments.
  • Partner with software and hardware engineers to deliver highly operable, reliable, and scalable systems and pipelines, ensuring they have the tools and infrastructure needed for rapid iteration.
  • Identify, analyze, and resolve system bottlenecks and reliability risks; perform performance tuning and implement long-term stability improvements.
  • Respond to and resolve production incidents; perform root cause analysis and drive corrective actions through blameless postmortems.
  • Rotate through the team’s on-call schedule to keep critical systems healthy and responsive.
  • Must be willing to work extended hours and weekends as needed
  • Occasionally travel to customer sites and other Varda locations to troubleshoot, deploy, or test critical infrastructure.

Benefits

  • Equity in a fully funded space startup with potential for significant growth (interns excluded)
  • 401(k) matching (interns excluded)
  • Unlimited PTO (interns excluded)
  • Health insurance, including Vision and Dental
  • Lunch and snacks provided on site every day. Dinners provided twice a week.
  • Maternity / Paternity leave (interns excluded)
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service