As a staff engineer on ML Compute team, your work will include: - Lead the development of the infrastructure to run large-scale workloads on the Cloud, such as Apache Spark, Ray, and distributed training. - Optimize platform efficiency and throughput by improving resource management capabilities with schedulers like Apache YuniKorn and Kueue. - Integrate new features from core distributed computing and ML frameworks into the platform, offering them to production users and providing support. - Enhance the platform's scalability, performance, and observability through improved monitoring and logging. - Drive the architectural evolution of the platform by adopting modern, cloud-native technologies to improve system performance, efficiency, and scalability. - Reduce dev-ops efforts by automating and streamlining operational processes. - Mentor engineers in areas of your expertise, fostering skill growth and knowledge sharing.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Education Level
Bachelor's degree
Number of Employees
5,001-10,000 employees