Software Engineer - Reliability

Luma AI, Inc.•Palo Alto, CA

41d

About The Position

Luma's mission is to build multimodal AI to expand human imagination and capabilities. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work. This is not a typical cloud SRE role. We are looking for a hands-on, first-principles engineer who is fluent in Linux and comfortable operating close to the metal. You will build, maintain, and scale Luma's large-scale GPU infrastructure, working directly on on-prem and multi-vendor cloud clusters. You'll solve complex systems problems, ensure reliability through clear SLOS/SLIs, and build automation that allows us to operate at an unprecedented scale with a lean team.

Requirements

5+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment.
Deep, hands-on expertise in Linux and containerized systems.
Strong experience with Kubernetes in production environments at meaningful scale.
Proficient in Python and/or Go, with a track record of building infrastructure tooling.
Strong understanding of networking, cloud infrastructure (AWS/GCP), and IaC tools like Terraform.
A tenacious troubleshooter who thrives on solving complex, low-level problems.
Experience managing large-scale GPU clusters for AI/ML workloads (training or inference).

Nice To Haves

Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.
Experience debugging GPU performance issues with specialized tools.

Responsibilities

Own GPU Cluster Reliability: Take end-to-end ownership of our GPU clusters for training and inference, ensuring high availability and peak performance across multiple cloud providers.
Drive Reliability Metrics: Define and maintain service-level objectives (SLOs) and indicators (SLIs) to measure and improve reliability as our infrastructure scales.
Deep Linux Expertise: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS level.
Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure.
Master Kubernetes at Scale: Operate and scale Kubernetes clusters beyond managed services, ensuring reliability across diverse workloads.
Modern Operations Practices: Implement and manage observability stacks (Prometheus, Grafana) and GitOps workflows (Argo CD, Flux) to keep infrastructure transparent and resilient.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume