Principal Product Manager

NVIDIASanta Clara, CA
$240,000 - $379,500Hybrid

About The Position

NVIDIA is driving a vision for AI factories that convert tokens to intelligence at scale to power AI demands of tomorrow. Maintaining AI infrastructure at scale takes more than human involvement; it demands smart automation. The orchestration engine for AI factory break-fix runs live in production at DGX Cloud. As the Product Manager leading all aspects of resilient automation at AI Factory, you will manage break-fix automation. You will develop the product strategy, improve operator experience, and guide the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs. This is your chance to compose how AI factories self-heal!

Requirements

  • 15+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background.
  • BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience.
  • Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.
  • Track record owning products with real-world operational consequences — you understand blast radius and build accordingly.
  • Strong operator UX instincts — proven ability to translate complex system state into workflows that on-call engineers can act on under pressure.
  • Ability to build alignment across engineering, SRE, and external vendor partner teams.

Nice To Haves

  • Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments.
  • Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale.
  • Background in reliability engineering, SLO build, or chaos/fault-injection testing.
  • Prior experience at a cloud service provider or Hyperscalers infrastructure team.
  • Experience building Agentic AI workflow software

Responsibilities

  • Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs.
  • Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety.
  • Build the operator UX for repair queues, workflow transparency, and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently.
  • Drive the integration between failure attribution and automated repair actions, following through from detection to resolution.
  • Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability.
  • Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale.

Benefits

  • Highly competitive salaries
  • Comprehensive benefits package
  • Equity
  • Benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service