About The Position

ML Platform powers hundreds of use cases and billions of inferences per day across Discovery, Safety, Economy, and Creation. We build the primitives that let teams train, evaluate, deploy, and operate models quickly and safely-so a new ML idea can reach production in weeks or less. We're looking for a Principal Platform Engineer who treats platform as a product: someone who can turn complex ML/AI infrastructure into clear, durable APIs and easy-to-use CLIs/UIs that our internal developers love. This role blends product thinking, developer experience, backend engineering, and infrastructure at scale. Hands-on ML experience is a plus; a track record of building internal platforms that developers love is a must.

Requirements

  • 7+ years of professional experience and have a wealth of system design experience upon which to draw to build a scalable, reliable ML platform for all of Roblox.
  • Proficiency in API design and developer experience-gRPC/REST APIs, SDKs, CLIs, and simple UIs that developers love to use.
  • Experience with the end-to-end ML model lifecycle such as model serving, training, model CI/CD, and GPU resources management, and have built ML platform features that are delightful to use.
  • Bachelor's degree in Computer Science, Computer Engineering, Data Science, or a similar technical field.

Responsibilities

  • Own platform as a product and set direction end to end: Define requirements, write RFDs, and ship APIs, SDKs, CLIs, and UIs that make ML@Roblox easy to adopt.
  • Bootstrap and maintain core ML Platform components: Serving Layer, Model Registry, Pipeline Orchestrator, and Training/Inference control planes.
  • Set technical strategy and oversee development of high scale and reliable infrastructure systems, with clear SLOs for latency, availability, and cost.
  • Design great developer experiences with paved-road templates, golden paths, opinionated defaults, and clear docs to reduce time-to-first-production.
  • Instrument the platform to measure adoption, friction, reliability, and cost; use data to prioritize roadmap and validate outcomes.
  • Partner across organizations (ML Engineering, Data Science, Infra/SRE, Security, Finance) to optimize performance, safety, and spend, especially for GPU-intensive training and high-QPS inference.
  • Propose and implement new platform tooling to improve time to production for MLEs across the full ML lifecycle.
  • Stay abreast of industry trends in machine learning and infrastructure to ensure the adoption of leading-edge technologies and practices.
  • Mentor junior and senior engineers, lead design reviews, and drive cross-team architectural decisions that last.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Principal

Industry

Administrative and Support Services

Education Level

Bachelor's degree

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service