OpenAI-posted 3 months ago
San Francisco, CA
1,001-5,000 employees

OpenAI’s Capacity Planning team ensures that our research and product teams have the compute, storage, and networking resources they need—when they need them. We work across engineering, product, and research to forecast demand, track supply, and optimize utilization of compute. Our goal is to develop data-driven, automated, and scalable planning systems that unlock the next generation of frontier AI models. We are looking for a Capacity Tooling Engineer to design, build, and maintain the internal platforms, services, and dashboards that power OpenAI’s capacity planning and allocation processes. You will create the tooling that helps us forecast usage, model scenarios, and make multi-billion-dollar infrastructure decisions. Your work will directly impact how we allocate compute across research, product launches, and strategic initiatives. This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

  • Build and scale tooling for capacity planning that incorporate data pipelines, forecasting dashboards, allocation solvers, and scenario modeling tools.
  • Integrate data sources from infrastructure teams, data science, and multiple cloud providers to create a single source of truth for compute supply, demand, and costs.
  • Develop real-time reporting and alerting to surface supply gaps, utilization trends, and risks to leadership.
  • Design and implement automations to streamline workflows such as demand collection and supply allocation.
  • Design and implement optimization engines and solvers that recommend optimal allocation of compute.
  • Build interactive models that allow leadership to test 'what-if' scenarios (e.g., varying levels of user growth, price changes, new product launches, etc).
  • Depth and expertise in one or more of the following areas: GPU, CPU, Storage, Networking.
  • Experience in AI/ML and/or cloud infrastructure.
  • Ability to make complex decisions with significant engineering, commercial, product and research implications, often with many billions of dollars involved.
  • Ability to thrive in ambiguity and work on a lean team as a self-starter.
  • Excited about building infrastructure at an incredible scale.
  • Ability to move fast, make decisions, and be held accountable.
  • Ability to wear multiple hats and juggle technical, business and engineering considerations.
  • Relocation assistance to new employees.
  • Hybrid work model of 3 days in the office per week.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service