Member of Technical Staff - RL Infrastructure

Vmax•San Francisco, CA

12d•$300,000 - $500,000•Hybrid

About The Position

This role is for strong infrastructure engineers who can build the systems layer for RL at scale: distributed rollouts, training orchestration, inference, evals, data pipelines, observability, and reliability. You will create the durable platform that enables researchers and applied ML engineers to run, debug, and reproduce large-scale RL experiments.

Requirements

Strong software engineering experience.
Experience building infrastructure for LLM inference and/or RL training.
Experience with GPU clusters, distributed training, model serving, or high-throughput inference systems.
Familiarity with vLLM, SGLang and modern LLM-RL training frameworks
Strong understanding of system reliability, observability, testing, debugging, and performance optimization.
Ability to work closely with ML researchers and translate messy experimental workflows into durable infrastructure.
Experience building tools, platforms, or services used by other technical users.
Strong judgment around technical tradeoffs: when to prototype, when to harden, when to simplify, and when to redesign.
Clear written and verbal communication, especially around system design, operational risks, and engineering tradeoffs.

Nice To Haves

Experience supporting research teams or fast-moving ML teams.
Experience at a high engineering bar organization where reliability, ownership, and code quality were central.
Evidence of strong independent technical work, such as open-source projects, infrastructure projects, competitions, or substantial systems built from scratch.
Experience reducing operational complexity in systems that had become brittle, slow, or hard to debug.

Responsibilities

Build infrastructure for distributed RL training and inference across thousands of GPUs
Improve the reliability, debuggability, and throughput of RL experiments.
Build interfaces that allow researchers and applied ML engineers to launch, inspect, compare, and reproduce experiments easily.
Own infrastructure projects end to end, from architecture and implementation through deployment, documentation, and long-term maintenance.
Identify and eliminate bottlenecks in training, rollout generation, eval execution, data movement, and cluster utilization.
Maintain engineering standards for RL infrastructure, including testing, observability, versioning, and reproducibility.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume