Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers used for frontier AI training, scientific computing, and large scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation. At this scale, interconnect fabrics are a first order reliability system that directly determines GPU availability, training throughput, and customer SLAs. As a Principal Supercomputing Operations Engineer, you serve as the technical authority and strategic owner for interconnect fabric operations across flagship AI supercomputing environments. You treat InfiniBand and GPU interconnect fabrics as a single end to end reliability domain, defining how they are operated, debugged, hardened, and scaled in production. This is a hands on, production first leadership role operating at the intersection of architecture, live operations, and reliability engineering. You will lead the most complex and impactful fabric related incidents, making high stakes technical decisions under ambiguity while balancing availability, risk, long term correctness, and customer impact. Beyond resolving incidents, you define failure models, operational strategy, and systemic prevention mechanisms that reduce recurrence at fleet scale. Your impact multiplies through technical leadership: setting operational standards, influencing engineering direction across teams, mentoring senior engineers, and partnering deeply with platform, hardware, firmware, and service teams to drive durable reliability improvements. You will architect and drive automation, diagnostics, and telemetry that materially improve operability and debuggability of interconnect fabrics, and author authoritative playbooks, TSGs, and escalation models relied on across the organization. Through your judgment, designs, and operational strategy, Azure’s largest AI platforms scale safely, predictably, and sustainably to meet the demands of next generation AI workloads. Microsoft’s mission is to empower every person and organization on the planet to achieve more. We work with a growth mindset, innovate to empower others, and collaborate to realize shared goals. Our culture is rooted in respect, integrity, and accountability, and we strive to build an environment where every engineer can learn, grow, and have real impact. As part of this team, you’ll help shape the next generation of cloud scale AI infrastructure and contribute to an inclusive culture where your expertise makes a difference every day.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Principal