Distributed Training Engineer

Periodic LabsMenlo Park, CA
92d

About The Position

You will optimize, operate and develop large-scale distributed LLM training systems that power AI scientific research. You will work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows. You will build tools and directly support frontier-scale experiments to make Periodic Labs the world’s best AI + science lab for physicists, computational materials scientists, AI researchers, and engineers. You will contribute open-source large scale LLM training frameworks.

Requirements

  • Experience with training on clusters with ≥5,000 GPUs.
  • Experience with 5D parallel LLM training.
  • Familiarity with distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan.
  • Ability to optimize training throughput for large scale Mixture-of-Expert models.

Responsibilities

  • Optimize, operate and develop large-scale distributed LLM training systems.
  • Work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows.
  • Build tools to support frontier-scale experiments.
  • Contribute to open-source large scale LLM training frameworks.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service