About The Position

Within the Seed-Infra-Training team, this sub team is responsible for ByteDance's large model training platform. We internally support ByteDance's basic large model training and generative AI business, supporting pre-training and post-training of language models, multi-modal understanding, video generation, etc. We have built a multi-tenant and multi-cloud heterogeneous GPU computing platform for customers, providing a series of stable, efficient, observable and diagnosable framework system platform components to help and support the expansion of large model training to Wanka and beyond.

Requirements

  • Currently in BS/MS program in distributed, parallel computing principles and know the recent advances in computing, storage, networking, and hardware technologies.
  • Familiarity with orchestration frameworks such as Kubernetes, Kubeflow, or Volcano
  • Proficient in at least one deep learning framework (e.g., PyTorch, Megatron, DeepSpeed, vLLM)
  • Experience with at least one major machine learning framework

Nice To Haves

  • Knowledge of fault tolerance and system reliability
  • Experience with large-scale training and LLM systems
  • Background in AIOps and resource scheduling
  • Papers selected by top conferences such as OSDI/SOSP/NSDI/ATC/Eurosys/SysML.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service