The Frontier Clusters team at OpenAI builds, launches, and supports the largest supercomputers in the world, with the mission to make them work for frontier model training. This involves bringing new clusters online, scaling them for larger training runs, improving cluster availability, and collaborating with researchers to address issues that hinder reliable job execution. The role requires debugging across hardware and software, developing operational tools, and working closely with researchers to ensure the reliability of frontier training. The successful candidate will build web-based tools to answer critical questions about supercomputing clusters, such as job downtime, node readiness, capacity loss, and potential agent-based solutions. This is an opportunity to work at the forefront of AI infrastructure, transforming complex operational challenges into clear, reliable, and scalable systems.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed