About The Position

AML-MLsys combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and Inference system/services around the world, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI. In our team, you'll have the opportunity to build the large scale heterogeneous system integrating with GPU/NPU/RDMA/Storage and keep it running stable and reliable, enrich your expertise in coding, performance analysis and distributed system, and be involved in the decision-making process. You'll also be part of a global team with members from the United States, China and Singapore working collaboratively towards unified project direction.

Responsibilities

  • Responsible for developing and optimizing LLM training&inference&RL framework.
  • Working closely with model researchers to scale LLM training&RL to the next level.
  • Responsible for GPU and CUDA Performance optimization to create an industry-leading high-performance LLM training and inference and RL engine.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service