Distributed Training Engineer

Periodic Labs•Menlo Park, CA

143d

About The Position

You will optimize, operate and develop large-scale distributed LLM training systems that power AI scientific research. You will work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows. You will build tools and directly support frontier-scale experiments to make Periodic Labs the world’s best AI + science lab for physicists, computational materials scientists, AI researchers, and engineers. You will contribute open-source large scale LLM training frameworks.

Requirements

Experience with training on clusters with ≥5,000 GPUs.
Experience with 5D parallel LLM training.
Familiarity with distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan.
Ability to optimize training throughput for large scale Mixture-of-Expert models.

Responsibilities

Optimize, operate and develop large-scale distributed LLM training systems.
Work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows.
Build tools to support frontier-scale experiments.
Contribute to open-source large scale LLM training frameworks.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

Distributed Training Engineer

About The Position

Requirements

Responsibilities

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company