Staff Technical Program Manager, AI Infrastructure

GM•Sunnyvale, CA

About The Position

We are seeking a Staff Technical Program Manager (TPM) to lead AV ML Infrastructure programs for our autonomous driving platform. In this role, you will drive strategy and execution for large-scale ML infrastructure — including training pipelines, model lifecycle management, compute orchestration, and operational reliability — that power next-generation autonomy models. You will operate at the intersection of ML engineering, platform infrastructure, and operations, ensuring our ML systems are scalable, efficient, and production-ready to support end-to-end model development at scale.

Requirements

10+ years of technical program management experience, including leadership of large, complex, multi-disciplinary programs.
5+ years working in ML Operations, ML infrastructure, AI platform engineering, or distributed compute environments.
BS or MS in Engineering, Computer Science, or a related technical field.
Experience supporting large-scale machine learning training or AI infrastructure programs, including compute orchestration, pipeline reliability, and resource management.
Proven track record of managing large, complex, cross-functional programs involving infrastructure, software systems, and data pipelines with ambiguous or evolving requirements.
Ability to analyze system performance metrics, identify bottlenecks, and translate insights into program-level improvements.
Exceptional communication, collaboration, and stakeholder management skills.
Deep familiarity with Agile program delivery, task management tools (e.g., Jira), reporting tools, and technical development tooling.

Nice To Haves

Experience with GPU compute management, cluster orchestration (e.g., Kubernetes, Slurm ), or cloud infrastructure (GCP, AWS).
Familiarity with ML workflow orchestration tools (e.g., Kubeflow, Airflow, or similar).
Background in SRE, platform engineering, or DevOps practices applied to ML systems.
Experience with observability, SLO/SLI frameworks, and incident management for production ML platforms.

Responsibilities

Lead end-to-end strategic planning and execution for AI ML Infrastructure programs, delivering measurable improvements in training throughput, platform reliability, and model development velocity.
Establish clear program objectives, milestones, and success metrics to drive predictable, high-quality delivery across multiple engineering and operations teams.
Collaborate with AI ML engineering, platform, validation, and product teams to define requirements, prioritize initiatives, and deliver solutions that improve AI development cycle performance and operational efficiency.
Translate complex MLOps needs — from distributed training orchestration to compute resource management and pipeline scaling — into actionable multi-team execution plans with defined owners and measurable outcomes.
Align long-term technical roadmaps with organizational goals, ensuring ML infrastructure evolves to support increasing model complexity, dataset scale, and training workloads.
Identify technical, operational, and program risks early; develop mitigation strategies that protect training timelines, platform stability, and service reliability.
Ensure AI ML operations processes and infrastructure are designed for long-term scalability, performance, and operational excellence — including monitoring, incident response, and capacity planning.
Define KPIs for ML platform performance, training system reliability, model training cycle time, and delivery velocity; maintain transparent dashboards and executive-ready reporting.
Provide leadership with clear insights into progress, tradeoffs, and program health to support timely decision-making.

Benefits

From day one, we're looking out for your well-being–at work and at home–so you can focus on realizing your ambitions. Learn how GM supports a rewarding career that rewards you personally by visiting Total Rewards resources.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume