Member of Technical Staff - RL Infrastructure

VmaxSan Francisco, CA
$300,000 - $500,000Hybrid

About The Position

This role is for strong infrastructure engineers who can build the systems layer for RL at scale: distributed rollouts, training orchestration, inference, evals, data pipelines, observability, and reliability. You will create the durable platform that enables researchers and applied ML engineers to run, debug, and reproduce large-scale RL experiments.

Requirements

  • Strong software engineering experience.
  • Experience building infrastructure for LLM inference and/or RL training.
  • Experience with GPU clusters, distributed training, model serving, or high-throughput inference systems.
  • Familiarity with vLLM, SGLang and modern LLM-RL training frameworks
  • Strong understanding of system reliability, observability, testing, debugging, and performance optimization.
  • Ability to work closely with ML researchers and translate messy experimental workflows into durable infrastructure.
  • Experience building tools, platforms, or services used by other technical users.
  • Strong judgment around technical tradeoffs: when to prototype, when to harden, when to simplify, and when to redesign.
  • Clear written and verbal communication, especially around system design, operational risks, and engineering tradeoffs.

Nice To Haves

  • Experience supporting research teams or fast-moving ML teams.
  • Experience at a high engineering bar organization where reliability, ownership, and code quality were central.
  • Evidence of strong independent technical work, such as open-source projects, infrastructure projects, competitions, or substantial systems built from scratch.
  • Experience reducing operational complexity in systems that had become brittle, slow, or hard to debug.

Responsibilities

  • Build infrastructure for distributed RL training and inference across thousands of GPUs
  • Improve the reliability, debuggability, and throughput of RL experiments.
  • Build interfaces that allow researchers and applied ML engineers to launch, inspect, compare, and reproduce experiments easily.
  • Own infrastructure projects end to end, from architecture and implementation through deployment, documentation, and long-term maintenance.
  • Identify and eliminate bottlenecks in training, rollout generation, eval execution, data movement, and cluster utilization.
  • Maintain engineering standards for RL infrastructure, including testing, observability, versioning, and reproducibility.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service