Research Scientist - Machine Learning System

ByteDance•San Jose, CA

85d

About The Position

AML-MLsys combines system engineering and the art of machine learning to develop and maintain massively distributed ML training and Inference system/services around the world, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI. In our team, you'll have the opportunity to build the large scale heterogeneous system integrating with GPU/NPU/RDMA/Storage and keep it running stable and reliable, enrich your expertise in coding, performance analysis and distributed system, and be involved in the decision-making process. You'll also be part of a global team with members from the United States, China and Singapore working collaboratively towards unified project direction.

Responsibilities

Responsible for developing and optimizing LLM training&inference&RL framework.
Working closely with model researchers to scale LLM training&RL to the next level.
Responsible for GPU and CUDA Performance optimization to create an industry-leading high-performance LLM training and inference and RL engine.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Industry

Publishing Industries

Number of Employees

5,001-10,000 employees

Research Scientist - Machine Learning System

About The Position

Responsibilities

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company