About The Position

Design, build, and operate Reflection’s large-scale GPU infrastructure powering pre-training, post-training, and inference. Develop reliable, high-performance systems for scheduling, orchestration, and observability across thousands of GPUs. Optimize cluster utilization, throughput, and cost efficiency while maintaining reliability at scale. Build tools and automation for distributed training, inference, monitoring, and experiment management. Collaborate closely with research, training, and platform teams to accelerate development and enable large-scale training and inference. Push the limits of hardware, networking, and software to accelerate the path from idea to model.

Requirements

  • Deep systems or infrastructure engineering experience in high-performance or distributed computing environments.
  • Strong understanding of GPUs, CUDA, NCCL, and large-scale training frameworks (PyTorch, DeepSpeed, JAX, etc.).
  • Hands-on experience with containerization, orchestration, and cluster management (Kubernetes, Slurm, etc.).
  • Familiarity with modern observability stacks and performance profiling tools.
  • Ability to thrive in a fast-paced, high-ownership startup environment.

Nice To Haves

  • Excited to build from zero to one defining frontier-scale training/RL infrastructure.
  • Motivated by enabling researchers and engineers to build open-weight AI systems.

Responsibilities

  • Design, build, and operate large-scale GPU infrastructure.
  • Develop systems for scheduling, orchestration, and observability across GPUs.
  • Optimize cluster utilization, throughput, and cost efficiency.
  • Build tools and automation for distributed training and inference.
  • Collaborate with research, training, and platform teams.
  • Push limits of hardware, networking, and software.

Benefits

  • Top-tier compensation: Salary and equity structured to recognize and retain the best talent globally.
  • Comprehensive medical, dental, vision, life, and disability insurance.
  • Fully paid parental leave for all new parents, including adoptive and surrogate journeys.
  • Financial support for family planning.
  • Paid time off when needed, wellness and time-saver stipend, commute benefits, education stipend, and relocation support.
  • Lunch and dinner provided daily, regular off-sites and team celebrations.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service