Full-Stack Software Engineer, Compute Foundations

OpenAI•San Francisco, CA

16d•$230,000 - $347,000

About The Position

The Frontier Clusters team at OpenAI builds, launches, and supports the largest supercomputers in the world, with the mission to make them work for frontier model training. This involves bringing new clusters online, scaling them for larger training runs, improving cluster availability, and collaborating with researchers to address issues that hinder reliable job execution. The role requires debugging across hardware and software, developing operational tools, and working closely with researchers to ensure the reliability of frontier training. The successful candidate will build web-based tools to answer critical questions about supercomputing clusters, such as job downtime, node readiness, capacity loss, and potential agent-based solutions. This is an opportunity to work at the forefront of AI infrastructure, transforming complex operational challenges into clear, reliable, and scalable systems.

Requirements

Significant experience in full-stack development, with expertise in modern frontend frameworks such as React, Vue, or Angular, and backend technologies such as Python, Go, or Node.js.
Built scalable, high-performance web applications for complex distributed systems.
Understand APIs, distributed data systems, and cloud infrastructure.
Execution-focused, with a strong eye for usability, performance, and scalability.
Comfortable working in fast-paced, highly collaborative environments with tight timelines and evolving priorities.

Nice To Haves

Experience working with Kubernetes, Docker, and cloud-native application deployment.
Understand AI/ML workload scheduling and orchestration challenges.
Experience with real-time data processing, visualization libraries, and observability tooling.

Responsibilities

Build full-stack web applications that help researchers and operators understand cluster health, job failures, and usable capacity in real time.
Turn questions like “why is this job down?” and “what is preventing these nodes from being ready?” into clear, actionable product experiences.
Collaborate closely with researchers and infrastructure teams to identify high-leverage operational problems and build tools that solve them.
Design data models, APIs, and visualizations that make large-scale scheduling, resource allocation, and cluster behavior easier to inspect and reason about.
Develop scalable backend services that process high-volume cluster and workload data with low latency and high reliability.
Build intuitive frontend workflows that connect users to the underlying scheduling, storage, and compute systems.
Raise the bar for reliability, performance, and security in the systems used to operate OpenAI’s frontier compute.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume