Software Engineer, Compute Infrastructure

OpenAISan Francisco, CA
Onsite

About The Position

We are looking for engineers to help build and operate the next generation of compute infrastructure powering OpenAI’s frontier research. This is an opportunity to work on the large-scale clusters, high-performance networks, and supercomputing systems that enable some of the most advanced AI workloads in the world. In this role, you’ll combine distributed systems engineering with hands-on infrastructure work across some of our largest data centers. You’ll help scale Kubernetes clusters to massive scale, automate bare-metal bring-up, and build the software layers that make heterogeneous GPU fleets and multi-datacenter supercomputing environments easier to operate. You’ll work where hardware and software meet, in an environment where speed, efficiency, and reliability are critical. That means solving real-time operational challenges, quickly diagnosing and fixing issues when they arise, and continuously improving automation, resilience, performance, and uptime across the systems that power frontier model training.

Requirements

  • Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
  • Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
  • Proficiency in compute infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operations

Nice To Haves

  • Experience operating large-scale compute fleets and enjoy bringing diverse hardware across providers, generations, and environments into one reliable platform
  • Care deeply about infrastructure efficiency and know how to maximize utilization so every GPU and CPU delivers meaningful work
  • Bring a strong bias for operational excellence, balancing speed with long-term quality and building systems that improve consistently over time
  • Focus on solving root causes rather than symptoms, and build trust by eliminating recurring pain points for users
  • Have experience improving training performance, reducing bottlenecks, and helping workloads run faster and more cost-effectively at scale
  • Enjoy pushing the limits of scale, from increasing concurrent workloads to enabling larger and more ambitious single-cluster jobs
  • Build intuitive platforms and tooling that empower researchers, product teams, and operators to self-serve with minimal manual support
  • Are comfortable working in fast-moving environments where ownership, reliability, and continuous improvement are essential
  • Background with GPU workloads, firmware management, or high-performance computing

Responsibilities

  • Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
  • Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
  • Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
  • Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles
  • Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure
  • Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

No Education Listed

Number of Employees

1-10 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service