Member of Technical Staff — RL Research

Nuance Labs•Seattle, WA

5d•$300,000 - $400,000•Onsite

About The Position

Nuance Labs is seeking a deeply technical Member of Technical Staff to own Reinforcement Learning (RL) and post-training for large-scale omni models. This role involves understanding modern post-training methods, building the necessary infrastructure for large-scale execution, and contributing to RL method development, rollout generation, reward modeling, policy optimization, evaluation, data feedback loops, serving, observability, and distributed execution. The successful candidate will build Nuance’s RL/post-training stack from the ground up and scale it significantly, translating research ideas into reliable training systems. The work extends beyond text to encompass audio, video, language, and real-time full-duplex interaction, focusing on improving interactive behavior, timing, interruption, emotional response, audiovisual coherence, and real-time conversational quality. This is a high-ownership role with direct impact on model improvement post-pretraining.

Requirements

Hands-on experience with RL, RLHF, RLAIF, post-training, alignment, or large-scale fine-tuning for modern foundation models.
Strong understanding of RL/post-training methods: policy optimization, reward modeling, preference optimization, rejection sampling, KL control, evaluation, and data feedback loops.
Ability to reason about model behavior and training dynamics: reward hacking, unstable rewards, distribution shift, stale policies, mode collapse, over-optimization, noisy preferences, and evaluation mismatch.
Practical experience building or operating RL/post-training pipelines with frameworks such as verl, ms-swift, OpenRLHF, or equivalent internal systems, including integration with rollout serving systems such as vLLM.
Experience with large-scale training or inference systems, including rollout generation, model serving, batching, queueing, GPU utilization, checkpointing, and debugging.
Understanding of omni post-training for real-time audio-video-language interaction: temporal alignment, interruption, emotional response, and multimodal evaluation.
Strong software engineering fundamentals, curiosity, and adaptability to new RL algorithms, model architectures, serving systems, evaluation methods, and research ideas.

Nice To Haves

Prior 0→1 experience building post-training systems, RL pipelines, agent training systems, evaluation platforms, or large-scale model improvement loops.
Experience with PPO, GRPO, DPO, online RL, RLHF/RLAIF, reward modeling, preference data, synthetic data generation, or model-based data improvement.
Experience with omni or multimodal post-training for audio-video-language models, especially long-context or real-time interactive systems.
Experience scaling mixed training/inference workloads across large GPU clusters.
Experience with adjacent areas such as distributed pretraining, data infrastructure, inference serving, simulation, human/AI feedback collection, or evaluation infrastructure.
Publications or substantial open-source contributions in RL, post-training, alignment, evaluation, ML systems, or model behavior.

Responsibilities

Build Nuance’s RL/post-training stack from 0→1: rollout generation, policy optimization, reward/reference model serving, data feedback loops, evaluation, checkpointing, observability, and debugging.
Develop and scale post-training methods such as PPO, GRPO, DPO, rejection sampling, RLHF/RLAIF, online RL, and model-based data improvement.
Design the systems abstractions that connect research ideas to production-scale RL runs: trainers, rollout workers, reward models, evaluators, data queues, experience buffers, and checkpoint promotion.
Build evaluation and feedback loops for omni behavior: turn-taking, interruption, timing, emotional response, audiovisual coherence, instruction following, and real-time interaction quality.
Optimize the end-to-end post-training loop across rollout throughput, serving latency, GPU utilization, policy update efficiency, queueing, checkpoint overhead, and research iteration speed.
Evolve the platform as algorithms, model architectures, reward definitions, data sources, and evaluation methods change.