About The Position

At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality. We're looking for a Staff / Principal ML Systems Engineer to own training systems performance end-to-end. You will define how our models train at scale — driving efficiency, scalability, and correctness across large-scale multimodal training. This is a core systems role, not infrastructure support. Your work directly determines how efficiently we use compute, how well models scale across thousands of GPUs, and how quickly research can iterate.

Requirements

  • Proven track record improving large-scale distributed training performance
  • Deep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)
  • Strong understanding of data / tensor / pipeline parallelism, sharded training (FSDP / ZeRO-style), communication patterns and overlap strategies, and scaling behavior across large GPU clusters
  • Strong systems intuition — ability to reason across compute, communication, and memory bottlenecks
  • Exceptional debugging and measurement ability: turn "training is slow" into clear bottlenecks, experiments, and validated improvements
  • High ownership mindset and comfort in a fast-moving environment

Nice To Haves

  • GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)
  • Experience with multimodal or video training (variable-length sequences, packing/bucketing)
  • Experience working on large-scale training frameworks or distributed runtimes
  • Familiarity with cluster topology, networking, and large-scale scheduling effects

Responsibilities

  • Own training performance end-to-end
  • Diagnose and improve performance of large-scale multimodal training (vision, video, proprioception, actions, language)
  • Build systematic performance attribution: step-time decomposition (compute vs communication vs input pipeline), scaling curves across cluster sizes, and bottleneck identification and prioritization
  • Drive measurable gains in: Distributed efficiency (comm/compute overlap, bucketization, topology-aware mapping, parallelism strategies)
  • Compute efficiency (kernel hotspots, operator fusion, attention optimization, framework/runtime overhead)
  • Memory efficiency (activation checkpointing, sequence packing/bucketing, fragmentation reduction)
  • Design training systems (not just tune them)
  • Define and evolve parallelism strategies: data / tensor / pipeline / sharding / hybrid approaches
  • Improve execution efficiency through communication scheduling and overlap, graph capture and execution optimization, and runtime-level improvements
  • Contribute to and extend training frameworks where needed
  • Make performance observable and measurable
  • Establish source-of-truth performance metrics: step-time breakdowns, MFU / throughput / scaling efficiency
  • Build tools to identify bottlenecks quickly, track performance across model families, and compare scaling behavior across configurations
  • Develop regression detection: microbenchmarks, performance baselines, and automated detection of efficiency regressions
  • Partner deeply with researchers
  • Work side-by-side with research scientists and research engineers — no silos
  • Translate model innovations into scalable, efficient implementations
  • Advise on training tradeoffs for robotics world models: long-horizon sequences, rollout/evaluation cadence, multimodal and variable-length data
  • Collaborate on cluster-level efficiency
  • Work with infrastructure/SRE teams to improve utilization across large distributed jobs, impact of network and collective performance on training, and topology-aware job placement and scaling behavior
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service