About The Position

On the AI Infra Team, you'll be immersed in the robust and scalable infrastructure that powers our cutting-edge artificial intelligence (AI) and machine learning (ML) initiatives. You will work closely with our AI/ML researchers, data scientists, and software engineers to create an efficient, high-performance environment for training, inference, and data processing. Your expertise will be critical in enabling the next generation of AI-driven products and services. We are looking for talented individuals to join us for an internship in 2026. PhD Internships at ByteDance aim to provide students with the opportunity to actively contribute to our products and research, and to the organization's future plans and emerging technologies. Internships at ByteDance aim to provide students with hands-on experience in developing fundamental skills and exploring potential career paths. A vibrant blend of social events and enriching development workshops will be available for you to explore. Here, you will utilize your knowledge in real-world scenarios while laying a strong foundation for personal and professional growth. PhD internships at Bytedance provide students with the opportunity to actively contribute to our products and research, and to the organization's future plans and emerging technologies. Our dynamic internship experience blends hands-on learning, enriching community-building and development events, and collaboration with industry experts. Applications will be reviewed on a rolling basis - we encourage you to apply early. Please state your availability clearly in your resume (Start date, End date).

Responsibilities

  • Lead end-to-end design of scalable, reliable AI infrastructure (AI accelerators, compute clusters, storage, networking) for training and serving large ML workloads.
  • Define and implement service-oriented, containerized architectures (Kubernetes, VM frameworks, unikernels) optimized for ML performance and security.
  • Profile and optimize every layer of the ML stack-ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks.
  • Develop low-overhead telemetry and benchmarking frameworks to identify and eliminate bottlenecks in distributed training and serving.
  • Build and operate large-scale deployment and orchestration systems that auto-scale across multiple data centers (on-premises and cloud).
  • Champion fault-tolerance, high availability, and cost-efficiency through smart resource management and workload placement.
  • Architect and implement robust ETL and data ingestion pipelines (Spark/Beam/Dask/Flume) tailored for petabyte-scale ML datasets.
  • Integrate experiment management and workflow orchestration tools (Airflow, Kubeflow, Metaflow) to streamline research-to-production.
  • Partner with ML researchers to translate prototype requirements into production-grade systems.
  • Mentor and coach engineers on best practices in performance tuning, systems design, and reliability engineering.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Career Level

Intern

Industry

Publishing Industries

Education Level

Ph.D. or professional degree

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service