Principal Software Engineer, Compute Provisioning

RobloxSan Mateo, CA
Hybrid

About The Position

Every day, tens of millions of people come to Roblox to explore, create, play, learn, and connect with friends in 3D immersive digital experiences– all created by our global community of developers and creators. At Roblox, we’re building the tools and platform that empower our community to bring any experience that they can imagine to life. Our vision is to reimagine the way people come together, from anywhere in the world, and on any device. We’re on a mission to connect a billion people with optimism and civility, and looking for amazing talent to help us get there. A career at Roblox means you’ll be working to shape the future of human interaction, solving unique technical challenges at scale, and helping to create safer, more civil shared experiences for everyone. As a Principal Software Engineer on the Fleet Management team, you will lead the systems that provision and rebuild Roblox's global fleet across bare metal and cloud. This team owns provisioning, and MAPI, the global Machine API that turns raw capacity into production-ready infrastructure in minutes, across hundreds of thousands of machines on-prem and cloud environments, including new GPU and new AI infrastructure. You will shape the technical direction for this critical compute platform and unify diverse hardware and environment-specific workflows behind MAPI and drive large-scale maintenance operations like firmware updates and hardware tuning.

Requirements

  • 8+ years of experience with strong expertise in distributed systems and infrastructure.
  • Bachelor's degree in computer science or equivalent field
  • Strong proficiency in Go, C/C++, Rust or other system level programming languages.
  • Experience building and operating large-scale distributed systems that other engineering teams depend on.

Nice To Haves

  • Familiarity with bare-metal concepts (PXE/iPXE, DHCP, BMC/IPMI/Redfish, OS imaging) is a plus; deep low-level systems experience is a bonus, not a requirement.
  • Interest in modern server hardware including GPU servers, AI accelerators, and cloud infrastructure.
  • A track record of building high-performance automation at fleet scale and reducing toil through developer-friendly APIs.

Responsibilities

  • Lead the Machine Bootstrap pod in building and evolving provisioning and fleet management at massive scale.
  • Architect and extend MAPI, the unified Machine API that abstracts bare-metal, GPU hosts, and cloud instances behind a single global interface.
  • Ship fleet-wide maintenance operations (BIOS updates, firmware updates, configuration changes) to hundreds of thousands of machines through MAPI.
  • Drive best-in-class provisioning performance, minutes to fully rebuild a machine from scratch.
  • Evaluate and integrate new hardware platforms including GPU servers and AI accelerators into the provisioning pipeline.
  • Collaborate across Compute, Networking, and Cloud teams on the full machine lifecycle from rack-and-stack to production.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service