Senior Principal Engineering Manager

MicrosoftRedmond, WA
20h

About The Position

Microsoft Research (MSR) is working to transform the future of artificial intelligence (AI) by bridging the gap between cutting-edge general AI and the specialized, real-world applications that drive meaningful impact. To pursue this mission, we're building world-class AI infrastructure that not only powers our models on large Graphics Processing Unit (GPU) clusters, but also accelerates our research lifecycle through agentic development. Our team has a global scope, powering the work of every Microsoft Research lab around the world. We're looking for a Senior Principal Engineering Manager to lead and grow our team that builds one of the world's largest research GPU clusters. This is a transformational leadership opportunity. You will grow a talented team of engineers and evolve it into a cohesive, high-performing organization that designs, builds, and operates world-class research compute infrastructure at scale. You will set the vision for how the team works, grows, and delivers, while driving the execution rigor needed to ship complex infrastructure reliably in a highly dynamic environment. If you're passionate about leading teams at the frontier of AI infrastructure and want to shape the future of how research compute is built and operated, we invite you to explore this opportunity. At Microsoft, our mission—to empower every person and every organization on the planet to achieve more—guides how we partner with customers to deliver trusted, impactful solutions. With a growth mindset culture, we innovate responsibly and measure success by shared progress—people, teams, and customers. Join us to do meaningful work that changes the world and helps shape what’s next for everyone.

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Nice To Haves

  • 5+ years of people management experience leading software engineering teams, including managing principal engineers.
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads.
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability.
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments.
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch.
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms.
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise.
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team).
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Responsibilities

  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure.
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management.
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality.
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development.
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details.
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together.
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service