Member of Technical Staff: Pre-Training Infrastructure

Recruiting From ScratchSan Francisco, CA
Onsite

About The Position

We're representing a well-funded robotics and AI startup building autonomous systems for industrial environments. The company is developing a vertically integrated robotics platform that combines advanced machine learning, robotics infrastructure, and large-scale model training to solve some of the most challenging problems in physical automation. As one of the earliest members of the pre-training organization, you'll play a critical role in building the infrastructure that powers large-scale foundation model training. This team sits at the intersection of distributed systems, machine learning infrastructure, and hardware optimization, enabling researchers to train and iterate on increasingly sophisticated multimodal AI systems.

Requirements

  • 1–5 years of experience building infrastructure for large-scale machine learning training.
  • Direct experience owning or operating pre-training infrastructure for foundation models.
  • Experience managing distributed training systems across multi-node environments and clusters of 100+ GPUs.
  • Deep understanding of training bottlenecks related to compute, memory, networking, storage, and data loading.
  • Extensive experience with PyTorch and/or JAX in production or research environments.
  • Strong systems engineering skills spanning machine learning infrastructure, distributed systems, and hardware optimization.
  • Proven ability to troubleshoot issues across the full stack, including model code, data pipelines, infrastructure, and hardware.
  • Experience working in highly autonomous environments with significant ownership and responsibility.

Nice To Haves

  • Experience building infrastructure for multimodal, video, robotics, or foundation model training.
  • Background supporting large-scale pre-training, post-training, RLHF, preference learning, or synthetic data workflows.
  • Experience with autonomous systems, robotics, autonomous vehicles, or large-scale perception systems.
  • Familiarity with data quality systems, dataset auditing, deduplication, and evaluation contamination detection.
  • Strong academic background in Computer Science, Machine Learning, Robotics, or a related field.
  • Publications at top-tier machine learning conferences such as NeurIPS, ICML, or ICLR.
  • Experience as an early startup employee or sole owner of critical machine learning infrastructure systems.

Responsibilities

  • Design and maintain distributed training infrastructure for large-scale foundation model development.
  • Build efficient and reproducible multi-GPU and multi-node training workflows.
  • Develop high-performance data pipelines capable of handling multimodal datasets, including video and large-scale structured data.
  • Optimize GPU utilization, training throughput, and hardware efficiency across large compute clusters.
  • Implement systems for checkpointing, experiment tracking, evaluation, reproducibility, and model comparison.
  • Build scalable data loading, sharding, and preprocessing infrastructure to support rapidly growing datasets.
  • Debug and resolve issues across model code, infrastructure, networking, storage, and hardware layers.
  • Partner closely with research teams to accelerate experimentation and improve model training velocity.
  • Establish reliable training baselines and infrastructure standards that support future model development.
  • Help define the company’s long-term training infrastructure strategy as one of the earliest hires in the function.

Benefits

  • Base salary: $200,000–$350,000.
  • Equity: 0.25–0.40%.
  • Visa sponsorship available for select visa categories.
  • Opportunity to join as one of the earliest members of a highly technical machine learning infrastructure team.
  • Direct ownership of foundational systems that influence research velocity and model performance.
  • Exposure to cutting-edge robotics, multimodal AI, and large-scale foundation model development.
  • High-impact role with significant autonomy and technical ownership.
  • Collaborative onsite culture focused on speed, execution, and ambitious technical goals.
  • Opportunity to work alongside engineers and researchers from leading AI, infrastructure, and robotics organizations.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service