Software Engineer: ML Infra

Generalist•San Francisco, CA

21h

About The Position

Generalist trains very large robot foundation models. This requires utilizing very large numbers of the latest generation GPU hardware and infrastructure (currently Nvidia) to run distributed training jobs and researcher experiments. We have extreme requirements on storage and data loading infrastructure that requires maximizing cloud infrastructure and custom solutions. You will also own inference infrastructure. For our robots this is a fleet of on-prem GPUs attached to robots that have extreme real-time and latency budgets in compute constrained environments.

Requirements

Managed large fleets of GPUs doing large-scale, long-term, highly distributed training runs or inference
Deep experience in Slurm or Kubernetes for ML workload orchestration
Build high-scale ML data loaders and preparation systems
Deeply understand every layer of the ML hardware, storage, and networking stacks
Experience in the NVidia GPU ecosystem

Responsibilities

Owning our GPU compute fleets
Ensure our GPUs are easy for researchers to use and maximally utilized
Optimizing and improving ML data loading transport and storage in highly distributed fully utilized environments.
Orchestration of robot inference fleets

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume