About The Position

We are seeking a highly motivated and experienced Infrastructure and Node Delivery Technical Program Manager (TPM) to join our dynamic team focused on the New Product Introduction (NPI) of our next-generation GPU hardware provisioning and delivery. In this pivotal role, you will be instrumental in leading the cross-functional efforts from concept to mass production, ensuring the timely, high-quality, and cost-effective delivery of our innovative GPU products that power the future of AI. You will navigate the complexities of hardware development cycles, collaborate with world-class engineers, external vendors, and influence strategic decisions to bring groundbreaking technology to market. The Technical Program Manager will Production Readiness: Ensure infrastructure and system software are production-ready for new hardware and compute platforms. Engage in technical discussions with engineering teams, challenge assumptions, and contribute to problem-solving. Program Leadership: Drive end-to-end programs spanning GPU provisioning, at-scale deployments, Fleet NPI readiness, and vendor management. Anticipate and identify potential risks, proactively develop mitigation strategies, and drive timely resolution of technical and logistical challenges. Reliability & SLA Management: Coordinate with hardware compute engineering, Fleet teams, and external vendors to maintain service reliability, enforce SLAs, and lead incident response efforts. Observability & Telemetry: Partner with engineering teams to improve monitoring, telemetry, and fleet observability for proactive performance management. Metrics & Insights: Define and track metrics around GPU fleet health, performance, and reliability. Postmortems & Continuous Improvement: Run post-incident reviews and drive action items that enhance system reliability and prevent regressions. Internal Enablement: Collaborate with internal customers to collect feedback, enable adoption of core infrastructure platforms, and refine onboarding experiences (e.g., K8s Core Interface, CKS, SUNK) for hardware compute NPIs. Cross-functional Coordination: Work closely with Product, Infrastructure, Platform Engineering, Vendor, and Customer Experiences to align on roadmap priorities and customer delivery timelines. Effective Communication: Communicate program status, risks, and critical decisions to senior leadership and executive stakeholders with clarity and conciseness. Foster a culture of transparency, collaboration, and continuous improvement within the NPI process.

Requirements

  • Bachelor's degree in Electrical Engineering, Computer Engineering, or a related technical field.
  • 10+ years of experience in technical program management in GPU provisioning, fleet management, or large-scale compute infrastructure.
  • Background in observability, monitoring, or telemetry systems (e.g., Prometheus, Grafana, OpenTelemetry).
  • Hands-on experience coordinating NPI or GTM readiness for compute products.
  • Technical understanding of system software orchestration and hardware/software integration.
  • Solid understanding of hardware and fleet development lifecycles.
  • Proven ability to lead cross-functional teams, influence without direct authority, and drive consensus in a fast-paced environment.
  • Exceptional communication, interpersonal, and presentation skills.
  • Proficiency in program management tools (e.g., Jira, Confluence, Sheet).

Nice To Haves

  • Master's degree in Engineering or an MBA.
  • Experience with GPU or other high-performance compute architecture NPI.
  • Experience working with international manufacturing partners and supply chains.
  • Experience with agile methodologies in a hardware and software development context.

Responsibilities

  • Ensure infrastructure and system software are production-ready for new hardware and compute platforms.
  • Engage in technical discussions with engineering teams, challenge assumptions, and contribute to problem-solving.
  • Drive end-to-end programs spanning GPU provisioning, at-scale deployments, Fleet NPI readiness, and vendor management.
  • Anticipate and identify potential risks, proactively develop mitigation strategies, and drive timely resolution of technical and logistical challenges.
  • Coordinate with hardware compute engineering, Fleet teams, and external vendors to maintain service reliability, enforce SLAs, and lead incident response efforts.
  • Partner with engineering teams to improve monitoring, telemetry, and fleet observability for proactive performance management.
  • Define and track metrics around GPU fleet health, performance, and reliability.
  • Run post-incident reviews and drive action items that enhance system reliability and prevent regressions.
  • Collaborate with internal customers to collect feedback, enable adoption of core infrastructure platforms, and refine onboarding experiences (e.g., K8s Core Interface, CKS, SUNK) for hardware compute NPIs.
  • Work closely with Product, Infrastructure, Platform Engineering, Vendor, and Customer Experiences to align on roadmap priorities and customer delivery timelines.
  • Communicate program status, risks, and critical decisions to senior leadership and executive stakeholders with clarity and conciseness.
  • Foster a culture of transparency, collaboration, and continuous improvement within the NPI process.

Benefits

  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

501-1,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service