Member of Technical Staff — RL Research

Nuance LabsSeattle, WA
$300,000 - $400,000Onsite

About The Position

Nuance Labs is seeking a deeply technical Member of Technical Staff to own Reinforcement Learning (RL) and post-training for large-scale omni models. This role involves understanding modern post-training methods, building the necessary infrastructure for large-scale execution, and contributing to RL method development, rollout generation, reward modeling, policy optimization, evaluation, data feedback loops, serving, observability, and distributed execution. The successful candidate will build Nuance’s RL/post-training stack from the ground up and scale it significantly, translating research ideas into reliable training systems. The work extends beyond text to encompass audio, video, language, and real-time full-duplex interaction, focusing on improving interactive behavior, timing, interruption, emotional response, audiovisual coherence, and real-time conversational quality. This is a high-ownership role with direct impact on model improvement post-pretraining.

Requirements

  • Hands-on experience with RL, RLHF, RLAIF, post-training, alignment, or large-scale fine-tuning for modern foundation models.
  • Strong understanding of RL/post-training methods: policy optimization, reward modeling, preference optimization, rejection sampling, KL control, evaluation, and data feedback loops.
  • Ability to reason about model behavior and training dynamics: reward hacking, unstable rewards, distribution shift, stale policies, mode collapse, over-optimization, noisy preferences, and evaluation mismatch.
  • Practical experience building or operating RL/post-training pipelines with frameworks such as verl, ms-swift, OpenRLHF, or equivalent internal systems, including integration with rollout serving systems such as vLLM.
  • Experience with large-scale training or inference systems, including rollout generation, model serving, batching, queueing, GPU utilization, checkpointing, and debugging.
  • Understanding of omni post-training for real-time audio-video-language interaction: temporal alignment, interruption, emotional response, and multimodal evaluation.
  • Strong software engineering fundamentals, curiosity, and adaptability to new RL algorithms, model architectures, serving systems, evaluation methods, and research ideas.

Nice To Haves

  • Prior 0→1 experience building post-training systems, RL pipelines, agent training systems, evaluation platforms, or large-scale model improvement loops.
  • Experience with PPO, GRPO, DPO, online RL, RLHF/RLAIF, reward modeling, preference data, synthetic data generation, or model-based data improvement.
  • Experience with omni or multimodal post-training for audio-video-language models, especially long-context or real-time interactive systems.
  • Experience scaling mixed training/inference workloads across large GPU clusters.
  • Experience with adjacent areas such as distributed pretraining, data infrastructure, inference serving, simulation, human/AI feedback collection, or evaluation infrastructure.
  • Publications or substantial open-source contributions in RL, post-training, alignment, evaluation, ML systems, or model behavior.

Responsibilities

  • Build Nuance’s RL/post-training stack from 0→1: rollout generation, policy optimization, reward/reference model serving, data feedback loops, evaluation, checkpointing, observability, and debugging.
  • Develop and scale post-training methods such as PPO, GRPO, DPO, rejection sampling, RLHF/RLAIF, online RL, and model-based data improvement.
  • Design the systems abstractions that connect research ideas to production-scale RL runs: trainers, rollout workers, reward models, evaluators, data queues, experience buffers, and checkpoint promotion.
  • Build evaluation and feedback loops for omni behavior: turn-taking, interruption, timing, emotional response, audiovisual coherence, instruction following, and real-time interaction quality.
  • Optimize the end-to-end post-training loop across rollout throughput, serving latency, GPU utilization, policy update efficiency, queueing, checkpoint overhead, and research iteration speed.
  • Evolve the platform as algorithms, model architectures, reward definitions, data sources, and evaluation methods change.

Benefits

  • HSA plan with ~$2,000 in company contributions
  • 15 days PTO + public holidays
  • Company closes for a full week over the holidays
  • Lunch, beverages, and snacks provided daily
  • Commuter benefits
  • 401K (in the works)
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service