Machine Learning Infra Engineer

ReductoSan Francisco, CA
Onsite

About The Position

As an ML Infra Engineer, you’ll play a key role in building the inference and training frameworks that make it possible to deliver results at scale. You’ll collaborate closely with our ML and Platform teams to scale training across nodes, develop faster and more efficient serving, and create observability across the stack. This is a high-impact role where you’ll help define what high performance ML training and inference look like at Reducto.

Requirements

  • Hold yourself to a high bar for quality and precision.
  • Enjoy solving complex problems and building from first principles.
  • Have strong Python skills + a background in systems engineering.
  • Are comfortable with Kubernetes and distributed training frameworks.
  • Love getting your hands dirty with real-world implementation challenges.
  • Operate well in fast-changing, high-growth environments.
  • Collaborate effectively across technical and non-technical teams.
  • Take full ownership from strategy through execution.

Nice To Haves

  • Have experience at an early-stage or high-growth startup.
  • Have developed in open source training/inference stacks in a meaningful way.
  • Are excited to set up distributed inference across 100s-1000s of GPUs.
  • Care deeply about combining technical excellence with business impact.

Responsibilities

  • Build, and maintain our training and inference stack with an emphasis for fast iteration on training + flexibility for exploring new methods and high performance in inference.
  • Develop benchmarks for both sets of stacks to identify bottlenecks.
  • Explore SOTA advances in training and inference and work to apply them.
  • Design systems for scaling model training across multi-node, multi-GPU environments with strong reliability and observability.
  • Scale distributed training and inference workloads across large GPU clusters while improving utilization, reliability, and cost efficiency.
  • Build the tooling, abstractions, and observability that help ML engineers move faster from experiment to production.

Benefits

  • Unlimited PTO
  • Free lunch daily at the office
  • Reimbursed Transportation
  • Generous health insurance covering medical, dental, and vision.
  • Health and Wellness Budget: up to $150/mo reimbursement for health and wellness spending, such as gym memberships, fitness classes, or similar.
  • Parental Leave
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service