Microsoft Azure’s Artificial Intelligence and High‑Performance Computing (AI/HPC) organization powers some of the world’s largest cloud‑native supercomputers used for frontier AI training, scientific computing, and large‑scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation. At this supercomputing scale, reliability and operational excellence are engineering challenges of their own. As a Senior Supercomputing Operations Engineer, you will own day‑to‑day operations of InfiniBand and GPU interconnect fabrics and treating them as a single, mission‑critical reliability domain that directly impacts GPU availability, training throughput, and customer SLAs. You will lead incident triage and mitigation, debug complex fabric‑layer failures, and correlate telemetry across nodes, switches, SM behavior, and GPU subsystems to identify true root causes. Your work will focus on resolving real production incidents at scale, improving operational readiness, and preventing recurrence through better tooling, automation, and deep systems understanding. You will build and use state‑of‑the-art tools to detect issues proactively, close operational gaps, and improve observability across our fabrics. You will contribute to TSGs, operational playbooks, and escalation guides while partnering with internal engineering teams and industry leading manufacturers to drive meaningful fixes. The solutions you develop and the operational improvements you drive will uplift the reliability of Azure’s largest supercomputing deployments and directly support the most compute‑intensive AI workloads running in the cloud. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level