About The Position

As a staff engineer on ML Compute team, your work will include: - Lead the development of the infrastructure to run large-scale workloads on the Cloud, such as Apache Spark, Ray, and distributed training. - Optimize platform efficiency and throughput by improving resource management capabilities with schedulers like Apache YuniKorn and Kueue. - Integrate new features from core distributed computing and ML frameworks into the platform, offering them to production users and providing support. - Enhance the platform's scalability, performance, and observability through improved monitoring and logging. - Drive the architectural evolution of the platform by adopting modern, cloud-native technologies to improve system performance, efficiency, and scalability. - Reduce dev-ops efforts by automating and streamlining operational processes. - Mentor engineers in areas of your expertise, fostering skill growth and knowledge sharing.

Requirements

  • Bachelors in Computer Science, engineering, or a related field.
  • 6+ years of hands-on experience in building scalable backend systems for training and evaluation of machine learning models.
  • Proficient in relevant programming languages, like Python or Go.
  • Strong expertise in distributed systems, reliability and scalability, containerization, and cloud platforms.
  • Proficient in cloud computing infrastructure and tools: Kubernetes, Ray, PySpark.
  • Ability to clearly and concisely communicate technical and architectural problems, while working with partners to iteratively find solutions.

Nice To Haves

  • Advance degrees in Computer Science, engineering, or a related field.
  • Hands-on experience with cloud-native resource management and scheduling tools like Apache YuniKorn.
  • Experience with advanced architecture for distributed data processing and ML workloads.
  • Proficient in working with and debugging accelerators, like: GPU, TPU, AWS Trainium.

Responsibilities

  • Lead the development of the infrastructure to run large-scale workloads on the Cloud, such as Apache Spark, Ray, and distributed training.
  • Optimize platform efficiency and throughput by improving resource management capabilities with schedulers like Apache YuniKorn and Kueue.
  • Integrate new features from core distributed computing and ML frameworks into the platform, offering them to production users and providing support.
  • Enhance the platform's scalability, performance, and observability through improved monitoring and logging.
  • Drive the architectural evolution of the platform by adopting modern, cloud-native technologies to improve system performance, efficiency, and scalability.
  • Reduce dev-ops efforts by automating and streamlining operational processes.
  • Mentor engineers in areas of your expertise, fostering skill growth and knowledge sharing.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service