Software Engineer, Compute Infrastructure

OpenAI•San Francisco, CA

2d•Onsite

About The Position

We are looking for engineers to help build and operate the next generation of compute infrastructure powering OpenAI’s frontier research. This is an opportunity to work on the large-scale clusters, high-performance networks, and supercomputing systems that enable some of the most advanced AI workloads in the world. In this role, you’ll combine distributed systems engineering with hands-on infrastructure work across some of our largest data centers. You’ll help scale Kubernetes clusters to massive scale, automate bare-metal bring-up, and build the software layers that make heterogeneous GPU fleets and multi-datacenter supercomputing environments easier to operate. You’ll work where hardware and software meet, in an environment where speed, efficiency, and reliability are critical. That means solving real-time operational challenges, quickly diagnosing and fixing issues when they arise, and continuously improving automation, resilience, performance, and uptime across the systems that power frontier model training.

Requirements

Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
Proficiency in compute infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operations

Nice To Haves

Experience operating large-scale compute fleets and enjoy bringing diverse hardware across providers, generations, and environments into one reliable platform
Care deeply about infrastructure efficiency and know how to maximize utilization so every GPU and CPU delivers meaningful work
Bring a strong bias for operational excellence, balancing speed with long-term quality and building systems that improve consistently over time
Focus on solving root causes rather than symptoms, and build trust by eliminating recurring pain points for users
Have experience improving training performance, reducing bottlenecks, and helping workloads run faster and more cost-effectively at scale
Enjoy pushing the limits of scale, from increasing concurrent workloads to enabling larger and more ambitious single-cluster jobs
Build intuitive platforms and tooling that empower researchers, product teams, and operators to self-serve with minimal manual support
Are comfortable working in fast-moving environments where ownership, reliability, and continuous improvement are essential
Background with GPU workloads, firmware management, or high-performance computing

Responsibilities

Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles
Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure
Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume