Engineering Manager - ML Platform and Infrastructure

Applied IntuitionSunnyvale, CA
5hOnsite

About The Position

As an Engineering Manager on the ML Platform team, you'll lead a world-class group of engineers focused on building the infrastructure that powers Physical AI at scale. Your team will own three critical areas: Training & Inference Orchestration, where we build frameworks to efficiently schedule and run massive jobs across thousands of GPUs; GPU Cluster Architecture, where we design and scale what will be the largest GPU cluster for Physical AI in the industry; and Performance Optimization, where we push the limits of hardware utilization, throughput, and cost efficiency for large-scale training and inference workloads. You'll work at the intersection of systems engineering and ML, partnering directly with stack development and research teams to remove bottlenecks and accelerate the path from experimentation to production.

Requirements

  • 3+ years of engineering management experience, ideally leading infrastructure or platform teams
  • Passion for building and leading high-performing teams that operate at the frontier of scale
  • Deep experience with distributed systems, GPU computing, or large-scale ML infrastructure
  • Direct experience building or operating large GPU clusters (1,000+ GPUs)
  • Strong understanding of distributed training frameworks (e.g., PyTorch Distributed, Megatron-LM, DeepSpeed, FSDP) and job orchestration at scale
  • Familiarity with GPU cluster management, high-performance networking (InfiniBand, RDMA), and resource scheduling (Slurm, Kubernetes)
  • Track record of building and operating systems that run reliably at massive scale

Nice To Haves

  • Background in training optimization techniques such as mixed-precision training, pipeline/tensor/data parallelism, or checkpointing strategies
  • Experience with inference optimization (batching, model serving, quantization, compiler-level optimizations)
  • Familiarity with Physical AI domains such as autonomous driving, robotics, or simulation
  • Contributions to open-source ML infrastructure projects

Responsibilities

  • Grow and manage a team of world-class infrastructure and systems engineers with the goal of delivering a best-in-class ML platform for Physical AI
  • Own the design and evolution of frameworks for orchestrating distributed training and inference jobs across thousands of GPUs
  • Drive the buildout and scaling of our GPU cluster infrastructure, making critical decisions on architecture, scheduling, networking, and resource management
  • Lead efforts to optimize training and inference performance — including throughput, fault tolerance, GPU utilization, and cost efficiency at scale
  • Set team goals and roadmap in alignment with research milestones, model development timelines, and production deployment requirements
  • Partner closely with research, stack development, and infrastructure teams to understand their workflows and accelerate their iteration speed
  • Drive hiring, mentoring, and growth for a high-performing, mission-driven team

Benefits

  • equity in the form of options and/or restricted stock units
  • comprehensive health, dental, vision, life and disability insurance coverage
  • 401k retirement benefits with employer match
  • learning and wellness stipends
  • paid time off
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service