Microsoft Azure operates one of the world’s largest and most complex cloud compute fleets. As a Principal Technical Program Manager (TPM) in Compute Fleet Infrastructure, you will lead cross‑functional initiatives that ensure node‑level health, availability, and automated recovery across Azure’s global fleet, directly supporting the reliability and stability of customer workloads at scale. This role operates at the intersection of hardware, host operating system (OS), virtualization, control plane services, and data center operations. The mission is to transform low‑level node health signals into predictable, automated, and scalable recovery outcomes, protecting customer workloads while continuously raising the reliability standards of the Azure platform. You will own end‑to‑end programs that span health signal definition, fleet‑wide detection, mitigation strategies, and recovery automation. This work involves close collaboration with engineering, hardware, site reliability engineering (SRE), and operations teams to drive coordinated execution and measurable improvements across the compute fleet. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Principal