Why Reliability? Roblox serves over 100 million people every day across a platform that is constantly evolving — and behind every experience is infrastructure that has to work, every time, at massive scale. The Reliability team at Roblox operates at the depth and breadth of the Roblox stack. Availability of the platform is a key company goal. We are hiring our first Principal Machine Learning engineer within our team. As a Principal Machine Learning Engineer within Reliability, you will set the 3-5 year technical strategy and architectural blueprint for how machine learning systems/practices can be leveraged to improve the reliability of the overall Roblox platform. You will own the architectural and execution roadmap of leveraging massive data across - logs, traces, metrics, production changes, to proactively detect issues before they become real problems (MTTD) and/or reduce time to resolve incidents (MTTR). You will have the opportunity to cross functionally collaborate with other similar teams at Roblox to define best practices and software. You will: Define and Own the Technical Vision: Define and lead the multi-year technical vision, architectural strategy, and execution for machine learning solutions in Content Safety, ensuring these systems proactively and effectively detect and mitigate violative content at massive scale. Strategic Stakeholder Partnership: Collaborate with executive-level Product, Data Science, Policy, and Operations leaders to define and prioritize the strategic machine learning roadmap, influencing product strategy and demonstrating the impact of ML on user trust and safety outcomes. Lead Innovation: Oversee the adoption and safe deployment of innovative machine learning techniques (e.g., transfer-learning, self-supervised learning, quantization, LoRA, distillation). Drive End-to-End Product Development: You will not just model; you will build. You will work cross-functionally to construct datasets from scratch where none exist, build auto-labeling pipelines, and ship solutions to solve novel technical problems. Ship Code, Not Just Models: Expect to spend roughly 30-40% of your time on backend and integration work . You will be responsible for integrating your work into the production stack, leveraging modern AI coding tools (e.g., Cursor) to accelerate velocity and handle infrastructure complexity
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Principal
Education Level
No Education Listed