Technical Program Manager, AI Infrastructure

Cerebras SystemsSunnyvale, CA
3d

About The Position

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras , to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. About The Role Be part of the team that builds and operates the world's fastest AI infrastructure for training and inference. Your role as a TPM will help accelerate data center buildouts to meet the explosive demand for our inference service platform.

Requirements

  • Experience leading large, cross-functional infrastructure programs.
  • Experience with AI/ML, HPC, or accelerator-based infrastructure.
  • Strong understanding of data center power and cooling fundamentals.
  • Experience installing and managing network, storage, and compute devices.
  • Proven ability to define and operationalize metrics.
  • Strong written and executive-level communication skills.
  • Experience working with colocation providers and facilities teams.
  • Background in incident management, reliability, or service operations.

Nice To Haves

  • Experience running network operations teams is a plus.

Responsibilities

  • Own end-to-end technical programs for multiple data center buildouts, coordinating with partners, contractors, and internal teams.
  • Drive facility site readiness for power and cooling for Cerebras Wafer-Scale Engine systems.
  • Coordinate equipment delivery and manage vendor accountability for schedules and quality related to rack integration and inter-rack cabling.
  • Act as the single-threaded owner across internal partners: Hardware & Systems Engineering, Network & Storage Engineering, AI Cloud Infrastructure & Operations.
  • Enforce handover criteria between site completion, equipment deployment, and operations.
  • Own overall schedule tracking, risk identification, and mitigation, creating clear visibility for leadership.
  • Establish program governance, risk tracking, and RACI clarity.
  • Present program status, metrics, and operational risks to senior leadership.
  • Drive partner accountability on contractual milestones and commercial commitments.
  • Document repeatable processes and implement them to scale across future data centers.
  • Partner on installation, commissioning, change management, and break/fix workflows.
  • Lead incident reviews and postmortems, ensuring corrective actions are completed.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service