We are seeking a Senior Lead / Lead ML Platform Engineer to architect and own the technical direction for our Training and Inference infrastructure. This is a high-leverage role designed for an expert who understands the deep technical stack required to shift ML models from research to global production. You will be responsible for the "engine room" of the AMLG, ensuring that our MLEs can train massive models efficiently and serve them with sub-millisecond reliability. This role requires a unique blend of expertise in distributed systems and hardware acceleration. You will lead the adoption and optimization of AnyScale (Ray) for distributed training and manage a high-performance Kubernetes-based inference environment. You aren't just managing clusters; you are building a seamless, scalable platform that abstracts the complexity of GPUs and distributed compute for the entire organization. The ML Platform Lead is the force-multiplier for every other ML pod. In this role, you will directly shape: The Training Foundation: Establishing AnyScale/Ray as the standard for distributed compute, enabling MLEs to train models on petabytes of data without managing infrastructure. Inference at Scale: Architecting the serving layer that handles billions of requests per day, optimizing for both p99 latency and GPU utilization. Operational Excellence: Setting the organizational standards for how ML models are deployed, monitored, and scaled across the enterprise.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed