About The Position

ByteDance, founded in 2012, is on a mission to inspire creativity and enrich life through its diverse suite of products, including TikTok and various platforms tailored for the Chinese market. The company emphasizes the importance of creation, innovation, and teamwork in achieving its goals. The Doubao (Seed) Team, established in 2023, focuses on pioneering advanced AI foundation models, with research areas spanning deep learning, reinforcement learning, language, vision, audio, AI infrastructure, and AI safety. The team operates globally, leveraging substantial data and computing resources to develop proprietary models that power numerous ByteDance applications and are available to external clients. The Machine Learning (ML) System sub-team is dedicated to developing and maintaining distributed ML training and inference systems, ensuring high performance and reliability across various platforms.

Requirements

  • Strong experience in system engineering and machine learning.
  • Proficiency in developing and maintaining distributed systems.
  • Experience with resource management and planning in a multi-cloud environment.
  • Knowledge of disaster recovery and business service stability.
  • Ability to build monitoring and management tools for ML infrastructure.

Nice To Haves

  • Experience with GPU/NPU/RDMA/Storage systems.
  • Familiarity with large-scale heterogeneous systems.
  • Background in AI safety and infrastructure.

Responsibilities

  • Ensure ML systems operate efficiently for large model development, training, evaluation, and inference.
  • Maintain stability of offline tasks/services across multi-data center, multi-region, and multi-cloud scenarios.
  • Manage resources and planning, including computing and storage resources, cost, and budget.
  • Oversee global system disaster recovery, cluster machine governance, and stability of business services.
  • Improve resource utilization and operational efficiency.
  • Build software tools, products, and systems to monitor and manage ML infrastructure and services efficiently.
  • Provide system and business on-call support as part of the global team.

Benefits

  • Opportunity to work with cutting-edge AI technologies.
  • Collaborative global team environment.
  • Access to substantial data and computing resources.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service