Senior/Staff Machine Learning Engineer, Training Runtime Performance

NuroMountain View, CA
77d$235,030 - $352,290Onsite

About The Position

We are seeking a highly experienced Staff Software Engineer to join our ML Infrastructure team, focusing on optimizing training runtime efficiency and input pipelines for model training, evaluation, and distillation workloads. In this role you will enable models to train faster and more efficiently - accelerating our self-driving roadmap of commercial and personal mobility.

Requirements

  • B.S./M.S./Ph.D. in Computer Science, Electrical Engineering, or related technical field (or equivalent experience).
  • 4+ years of professional experience in ML infrastructure, distributed training, or ML systems engineering, scaling models on multi-node, multi-accelerator clusters.
  • Understanding of training, evaluation, and distillation workflows for billion-parameter models
  • Expert-level knowledge in distributed systems and (remote) Python
  • Strong skills in profiling, debugging, and optimizing quantized workloads.
  • Experience with ML compilers and strategies to reduce startup overhead
  • Familiarity with model distillation and efficient inference workflows.

Nice To Haves

  • Previous contributions to open source ML infra projects or research publications in ML systems.
  • Hands-on experience with Foundation Model infrastructure
  • Highly proficient in C++, distributed systems, ML framework internals (e.g., NCCL, Horovod, DeepSpeed, Ray)

Responsibilities

  • Collaborate with ML practitioners and other infrastructure teams to understand their needs and integrate optimized input pipelines seamlessly into their workflows.
  • Detect, diagnose, and resolve performance bottlenecks across training, eval, and model distillation workflows.
  • Optimize training performance, resource utilization, and ensure consistent, reproducible model training outcomes.
  • Optimize input data pipelines to increase runtime goodput, ensuring accelerators maximize their "time on task" and minimize idle cycles.
  • Champion best practices for robust, reproducible, and debuggable ML experimentation.

Benefits

  • annual performance bonus
  • equity
  • competitive benefits package
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service