About The Position

We are seeking a Staff Technical Program Manager (TPM) to lead AV ML Infrastructure programs for our autonomous driving platform. In this role, you will drive strategy and execution for large-scale ML infrastructure — including training pipelines, model lifecycle management, compute orchestration, and operational reliability — that power next-generation autonomy models. You will operate at the intersection of ML engineering, platform infrastructure, and operations, ensuring our ML systems are scalable, efficient, and production-ready to support end-to-end model development at scale.

Requirements

  • 10+ years of technical program management experience, including leadership of large, complex, multi-disciplinary programs.
  • 5+ years working in ML Operations, ML infrastructure, AI platform engineering, or distributed compute environments.
  • BS or MS in Engineering, Computer Science, or a related technical field.
  • Experience supporting large-scale machine learning training or AI infrastructure programs, including compute orchestration, pipeline reliability, and resource management.
  • Proven track record of managing large, complex, cross-functional programs involving infrastructure, software systems, and data pipelines with ambiguous or evolving requirements.
  • Ability to analyze system performance metrics, identify bottlenecks, and translate insights into program-level improvements.
  • Exceptional communication, collaboration, and stakeholder management skills.
  • Deep familiarity with Agile program delivery, task management tools (e.g., Jira), reporting tools, and technical development tooling.

Nice To Haves

  • Experience with GPU compute management, cluster orchestration (e.g., Kubernetes, Slurm ), or cloud infrastructure (GCP, AWS).
  • Familiarity with ML workflow orchestration tools (e.g., Kubeflow, Airflow, or similar).
  • Background in SRE, platform engineering, or DevOps practices applied to ML systems.
  • Experience with observability, SLO/SLI frameworks, and incident management for production ML platforms.

Responsibilities

  • Lead end-to-end strategic planning and execution for AI ML Infrastructure programs, delivering measurable improvements in training throughput, platform reliability, and model development velocity.
  • Establish clear program objectives, milestones, and success metrics to drive predictable, high-quality delivery across multiple engineering and operations teams.
  • Collaborate with AI ML engineering, platform, validation, and product teams to define requirements, prioritize initiatives, and deliver solutions that improve AI development cycle performance and operational efficiency.
  • Translate complex MLOps needs — from distributed training orchestration to compute resource management and pipeline scaling — into actionable multi-team execution plans with defined owners and measurable outcomes.
  • Align long-term technical roadmaps with organizational goals, ensuring ML infrastructure evolves to support increasing model complexity, dataset scale, and training workloads.
  • Identify technical, operational, and program risks early; develop mitigation strategies that protect training timelines, platform stability, and service reliability.
  • Ensure AI ML operations processes and infrastructure are designed for long-term scalability, performance, and operational excellence — including monitoring, incident response, and capacity planning.
  • Define KPIs for ML platform performance, training system reliability, model training cycle time, and delivery velocity; maintain transparent dashboards and executive-ready reporting.
  • Provide leadership with clear insights into progress, tradeoffs, and program health to support timely decision-making.

Benefits

  • From day one, we're looking out for your well-being–at work and at home–so you can focus on realizing your ambitions. Learn how GM supports a rewarding career that rewards you personally by visiting Total Rewards resources.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service