Governance & Operations Lead, Infrastructure & Planning

Apple•Cupertino, CA

About The Position

Apple’s Platform Acceleration & Compute Efficiency (PACE) is a high-leverage team operating at the critical intersection of our ML organizations, underlying compute infrastructure, and core platform tooling. Our mission is to empower Apple’s software engineering teams with efficient, scalable compute. By driving out operational friction and optimizing the broader machine learning ecosystem, we directly accelerate the pace of development across the company. We are seeking a founding Operations Lead with a passion for ML compute lifecycle management to build and own our governance and operations function. This is a ground-up opportunity to develop the roadmap and build a world-class, high level operations team. You will partner closely with tools and analytics leads to define the governance systems, telemetry frameworks, utilization dashboards, and analytical models that give PACE and Apple's ML leadership a clear, continuously updated picture of how compute is being used. Success for this role means zero project slips due to process, platform, or mis-prioritization. Mastery of this role means establishing and tracking key productivity metrics and proactively solving problems to keep those metrics consistently high. Your work will power core ML analytics both for internal development and inference serving. Most operations roles involve endless runbooks. This role is for someone who is also an engineer and artist at heart. Your operations work will empower a high-leverage team to drive decisions that have a significant impact on Apple’s financial results.

Requirements

BS in Computer Science, Data Science, Computer Engineering, or equivalent practical experience
5+ years in a governance/operations role, data engineering, analytics engineering, technical program management, or in a large-scale compute or cloud environment
Organized, process-oriented, and comfortable owning operational systems other people depend on daily
Strong cross-functional experience working with capable engineers, managers, EPMs, and leaders
Proven experience designing and operating complex systems and processes from the ground up
AI-fluent and capable of quickly adapting to AI workflows and empowerment
Direct experience managing SRE and hierarchical technical support systems
SQL and experience building analytical dashboards or data products (Tableau, Looker, Grafana, or similar)
Experience designing data models or telemetry schemas for infrastructure, capacity, or utilization data
Ability to translate raw technical metrics into clear business narratives for both engineers and executives

Nice To Haves

Experience with Python for data analysis (pandas, notebooks) or lightweight pipeline development
Familiarity with ML training infrastructure concepts: GPU utilization, training throughput, and scheduling efficiency mean, even if you have not optimized them directly
Prior experience in FinOps, capacity planning, cloud cost management, or IT governance
Experience building or operating data analytics systems
Background in automated alerting or anomaly detection for infrastructure metrics

Responsibilities

Own the daily operations of the systems you architect. You will design and oversee a scalable hub-and-spoke support model, spanning cross-functional tier-1 on-call teams, tier-2 team leads, and a dedicated tier-3 engineering escalation group that you will build and manage.
Own and evolve PACE's governance tooling and related systems, ensuring that compute resource requests, allocations, and utilization data are accurately captured to support rapid, at-scale analysis.
Bridge coverage gaps as Apple's ML ecosystem expands to new hardware (GPUs, TPUs, and custom silicon) and workloads (inference, on-device), balancing power, performance, cost, and compatibility.
Partner with the Data & Analytics Lead to maintain the analytical layer, building the dashboards, reports, and automated alerts that surface efficiency opportunities and track infrastructure savings.
Identify system anomalies and operational bottlenecks that degrade utilization and drive up costs, building financial impact models that translate technical metrics into actionable insights for leadership.
Partner with Apple's ML engineering teams, delivering data-driven analytics to optimize the foundation models, inference workloads, and platform tooling that rely on your data for success.
Design robust governance processes and automated operations engineered specifically to meet Apple-scale ML demands.
Partner to produce strategic analyses that inform executive decisions on ML compute investment, allocation, and strategy, directly influencing Apple's ML growth and feature development.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume