Senior Production Engineer

Anduril Industries•Columbia, CA

3d•Onsite

About The Position

The SRE team owns reliability and infrastructure for Anduril's cloud deployments. We operate Kubernetes clusters, Terraform infrastructure, and observability platforms across 10+ production environments supporting active defense contracts. When platform services break under real operational load, we're the team that fixes them — often at the code level, not just the config level. We are looking for a Senior Production Engineer to join our team in Costa Mesa, CA (or DC). In this role, you will be responsible for diagnosing and fixing stability vulnerabilities in core platform services that cause cascading failures in multi-tenant cloud deployments. You will write production Go to implement resilience patterns — leader election, circuit breakers, failure domain isolation — directly in service code. This will require deep experience with distributed systems, debugging complex failure modes across service boundaries, and writing production-quality Go. If you are someone who thrives on fixing hard reliability problems in live systems rather than building greenfield, this role is for you.

Requirements

Production-quality Go — you'll be modifying core platform services, not writing scripts
Practical experience with distributed systems: leader election, consensus, replication, failure modes
Kubernetes — enough to understand how services run (not necessarily cluster administration)
Debugging complex systems — tracing cascading failures across service boundaries
4+ years in SRE, platform engineering, or backend development roles
Must be a U.S. Person due to required access to U.S. export controlled information or facilities

Nice To Haves

Rust (some platform services use it)
Experience fixing reliability problems in production services (not just building greenfield)
Familiarity with gRPC service architectures
HashiCorp Consul or similar service discovery/mesh
FedRAMP/IL5 compliance environment experience
ArgoCD / GitOps workflows

Responsibilities

Diagnose and fix stability vulnerabilities in core platform services that cause cascading failures under multi-replica, multi-tenant operation
Implement resilience patterns (leader election, circuit breakers, failure domain isolation) directly in service code
Design multi-replica support for services that currently assume single-instance operation
Collaborate with service owners on contract testing and upgrade validation
Trace cascading failures across service boundaries and drive them to root-cause fixes
Contribute to observability platform improvements to support service stability
Light infrastructure work: Terraform/Kubernetes changes to support service fixes (~20% of time)

Benefits

Highly competitive equity grants are included in the majority of full time offers; and are considered part of Anduril's total compensation package.
Top-tier benefits for full-time employees, including: comprehensive, competitive benefits package (available at little to no cost to employees) ensures you’re supported in health, recovery, and whatever comes next.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume