Research Engineer - Training Platform

Rhoda AI•Palo Alto, CA

10d

About The Position

At Rhoda AI, we’re building the next generation of generalist intelligent robots. We own the full robotics stack from high-performance hardware and robot systems to the infrastructure and state-of-the-art foundation world models that control our robots. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling long-tail edge cases, made possibly by our cutting edge research and end-to-end system design. We've raised over $400M and are investing aggressively in model research, infrastructure, hardware development, and manufacturing scale-up to make generalist robotics a reality. We're looking for a Research Engineer to build and maintain the training platform that powers our model development — experiment orchestration, job management, observability, and the tooling that lets researchers move from idea to result as fast as possible.

Requirements

Strong software engineering skills with experience in MLOps or ML platform engineering
Familiarity with distributed training frameworks (PyTorch DDP, FSDP, DeepSpeed, Megatron, or similar)
Experience building experiment tracking, reproducibility, and artifact management systems
Comfortable managing and operating GPU cluster environments (Slurm, Kubernetes, or similar)
Strong reliability engineering instincts: monitoring, alerting, and failure recovery

Nice To Haves

Experience with training orchestration tools (Slurm, Ray, Kubernetes, or similar schedulers)
Familiarity with experiment tracking tools (Weights & Biases, MLflow, or custom solutions)
Experience supporting large model training pipelines (LLMs, VLMs, or video models)
Understanding of parallelism strategies and how they affect training efficiency and debugging
Experience with cloud-based training infrastructure (AWS, GCP, or Azure)

Responsibilities

Build and maintain training orchestration systems for large-scale distributed model training across GPU clusters
Develop experiment management tooling: job configuration, tracking, reproducibility, and artifact management
Build observability infrastructure for training runs: loss curves, compute utilization, gradient statistics, and anomaly detection
Optimize and automate the research iteration loop from experiment launch to results analysis
Manage job scheduling and cluster utilization for efficient use of GPU compute
Build internal tooling and interfaces that help researchers move faster
Collaborate with training systems, data infrastructure, and research teams to support their platform needs

Benefits

Your platform is the daily tool every researcher and engineer uses to train models
Improvements to training velocity and reliability compound across every experiment the team runs
High visibility with direct feedback from researchers and ML engineers
Build systems that scale from today's models to future frontier training runs

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume