ByteDance-posted 3 months ago
Mid Level
San Jose, CA
5,001-10,000 employees
Publishing Industries

ByteDance, founded in 2012, is on a mission to inspire creativity and enrich life through its diverse suite of products, including TikTok and various platforms tailored for the Chinese market. The company emphasizes the importance of creation, innovation, and teamwork in achieving its goals. The Doubao (Seed) Team, established in 2023, focuses on pioneering advanced AI foundation models, with research areas spanning deep learning, reinforcement learning, and AI safety. The Machine Learning (ML) System sub-team is dedicated to developing and maintaining distributed ML training and inference systems globally, ensuring high performance and reliability.

  • Ensure ML systems operate efficiently for large model development, training, evaluation, and inference.
  • Maintain stability of offline tasks/services across multi-data center, multi-region, and multi-cloud scenarios.
  • Manage resources and planning, including computing and storage resources, while overseeing cost and budget.
  • Oversee global system disaster recovery, cluster machine governance, and stability of business services.
  • Improve resource utilization and operational efficiency.
  • Build software tools, products, and systems for efficient monitoring and management of ML infrastructure and services.
  • Provide system and business on-call support as part of the global team.
  • Strong experience in system engineering and machine learning.
  • Proficiency in developing and maintaining distributed systems.
  • Experience with resource management and planning in cloud environments.
  • Knowledge of disaster recovery and business service stability.
  • Ability to build software tools for infrastructure management.
  • Experience with GPU/NPU/RDMA/Storage systems.
  • Familiarity with large-scale heterogeneous systems.
  • Background in AI safety and multimodal capabilities.
  • Opportunity to work with cutting-edge AI technologies.
  • Collaborative global team environment.
  • Access to substantial data and computing resources.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service