Principal Technical Program Manager

Microsoft
16h$163,000 - $296,400

About The Position

Microsoft Azure operates one of the world’s largest and most complex cloud compute fleets. As a Principal Technical Program Manager (TPM) in Compute Fleet Infrastructure, you will lead cross‑functional initiatives that ensure node‑level health, availability, and automated recovery across Azure’s global fleet, directly supporting the reliability and stability of customer workloads at scale. This role operates at the intersection of hardware, host operating system (OS), virtualization, control plane services, and data center operations. The mission is to transform low‑level node health signals into predictable, automated, and scalable recovery outcomes, protecting customer workloads while continuously raising the reliability standards of the Azure platform. You will own end‑to‑end programs that span health signal definition, fleet‑wide detection, mitigation strategies, and recovery automation. This work involves close collaboration with engineering, hardware, site reliability engineering (SRE), and operations teams to drive coordinated execution and measurable improvements across the compute fleet. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond

Requirements

  • Bachelor's Degree AND 8+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience.
  • 6+ years of experience managing cross-functional and/or cross-team projects.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice To Haves

  • Bachelor's Degree AND 15+ years experience engineering, product/technical program management, data analysis, or product development OR equivalent experience.
  • 10+ years of experience managing cross-functional and/or cross-team projects.
  • 1+ year(s) of experience reading and/or writing code (e.g., sample documentation, product demos).

Responsibilities

  • Own node health strategy across Azure compute fleets, including bare metal and virtualized environments.
  • Define what “healthy” means at the node level, aligning hardware, firmware, host OS, and virtualization signals into a consistent fleet health model.Drive measurable improvements in node availability, repair success rates, and recovery times across regions and SKUs.
  • Automated Detection, Mitigation, and Recovery:Lead programs that detect unhealthy nodes early, prevent customer impact, and automate recovery actions (e.g., repair, reprovisioning, isolation, or migration).
  • Partner with engineering teams to close gaps between signal detection and actionable remediation.
  • Ensure recovery mechanisms scale safely across large‑scale, heterogeneous fleets.
  • Cross‑Team Program Execution:Coordinate work across multiple organizations, including compute platform engineering, hardware systems, data center operations, and site reliability teams.
  • Translate ambiguous reliability problems into clear program plans, milestones, and success metrics.
  • Identify systemic issues and drive long‑term fixes rather than repeated tactical mitigations.
  • Metrics, Insights, and Continuous Improvement:Define and track fleet‑level health KPIs (e.g., nodes in service, recovery success, time‑to‑repair).Use data and post‑incident learnings to prioritize investments that reduce repeat failures.Represent node health and recovery readiness in executive and operational reviews.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service