The SRE team owns reliability and infrastructure for Anduril's cloud deployments. We operate Kubernetes clusters, Terraform infrastructure, and observability platforms across 10+ production environments supporting active defense contracts. When platform services break under real operational load, we're the team that fixes them — often at the code level, not just the config level. We are looking for a Senior Production Engineer to join our team in Costa Mesa, CA (or DC). In this role, you will be responsible for diagnosing and fixing stability vulnerabilities in core platform services that cause cascading failures in multi-tenant cloud deployments. You will write production Go to implement resilience patterns — leader election, circuit breakers, failure domain isolation — directly in service code. This will require deep experience with distributed systems, debugging complex failure modes across service boundaries, and writing production-quality Go. If you are someone who thrives on fixing hard reliability problems in live systems rather than building greenfield, this role is for you.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed