Head of Data Center Rack and Cluster

OpenAI•San Francisco, CA

3d•$391,000 - $456,000

About The Position

The Industrial Compute team at OpenAI is responsible for all things compute capacity, from our partnership deals (i.e. stargate) through the delivery and operations of this capacity. The team is a fast-paced, dynamic team working on the cutting edge of capacity at scale, speed, quality, and cost, while enabling the best AI Frontier models and beyond. This role is responsible for leading the engineering team that defines the rack and system architectures underpinning the compute infrastructure enabling OpenAI’s frontier AI models. The role will oversee the early phases of compute definition from architectural definition through delivery of production-ready racks. We are looking for an experienced engineering leader who understands how to navigate the technical and business tradeoffs required to build the world’s premiere AI infrastructure. Qualified candidates will have extensive hands-on experience in system bring up and a proven ability to bring platforms to a stable state ready for production use through lab and early test and deployment environments.

Requirements

Hyperscale data center experience (or equivalent)
Deep experience in rack, system, or network architecture definition
Can effectively leverage performance and TCO modeling and a deep understanding of business risks and requirements to guide complex and often ambiguous system tradeoffs.
Have experience overseeing delivery of complex new hardware platforms and can effectively shepherd systems from first delivery to production-ready state
Can manage vendor relationships to resolve rack and system issues in addition to shepherding vendor roadmaps to support OpenAI’s needs.
Excellent management and leadership skills.

Responsibilities

Own the reference rack, cluster, and system architecture standards for new OpenAI compute platforms.
Define readiness and acceptance criteria for production-bound systems.
Stay engaged through validation until configurations are proven repeatable and ready for handoff.
Manage relationships with accelerator and equipment vendors to define an overall roadmap
Partner across the industrial compute and partner teams to bring clarity to requirements and ensure smooth delivery of next-gen systems
Manage a team of engineering leads focused on the architectural and engineering work required to define, test and stabilize OpenAI’s future compute platforms.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume