Member of Technical Staff — RL Research (New PhD Grad)

Nuance Labs•Seattle, WA

2d•$250,000 - $350,000•Onsite

About The Position

Nuance Labs is building photorealistic, real-time AI avatars with emotional intelligence: a full-duplex audiovisual system that can listen, speak, react, interrupt, and respond like a real person. This posting is aimed at researchers who are completing — or have recently completed — a PhD and want to do their best work at a fast-moving frontier lab. This role is broader than a traditional RL algorithm role. You’ll be expected to understand modern post-training methods and help build the infrastructure needed to run them at scale. The work spans RL method development, rollout generation, reward modeling, policy optimization, evaluation, data feedback loops, serving, observability, and distributed execution. You’ll help build Nuance’s RL/post-training stack from 0→1 and scale it from 1→10. That means turning rapidly evolving research ideas into reliable training systems: defining the abstractions, choosing or modifying frameworks, wiring together rollout workers and trainers, building reward/evaluation loops, debugging failure modes, and making the system fast enough for researchers to iterate. For Nuance, post-training is not limited to text. Our models are omni from the ground up: audio, video, language, and real-time full-duplex interaction. We need RL and post-training methods that improve interactive behavior, timing, interruption, emotional response, audiovisual coherence, and real-time conversational quality. This is a high-ownership role with direct impact on how Nuance models improve after pretraining — and a place to grow fast alongside people who’ve built these systems before.

Requirements

A PhD — completed, or in its final stretch — in ML, RL, or a related field, with research depth shown through publications, a strong lab/advisor, or substantial open-source work.
Solid understanding of RL/post-training methods: policy optimization, reward modeling, preference optimization, rejection sampling, KL control, evaluation, and data feedback loops.
Ability to reason about model behavior and training dynamics: reward hacking, unstable rewards, distribution shift, stale policies, mode collapse, over-optimization, noisy preferences, and evaluation mismatch.
Exposure to RL/post-training pipelines through research, internships, or open-source — with frameworks such as verl, ms-swift, OpenRLHF, or equivalent, and familiarity with rollout serving systems such as vLLM. You don’t need to have run these at production scale yet; you need to learn fast and go deep.
Strong software engineering fundamentals and the appetite to build real systems, not just prototypes.
Curiosity and adaptability toward new RL algorithms, model architectures, serving systems, evaluation methods, and research ideas.

Nice To Haves

Hands-on experience with omni or multimodal post-training for audio-video-language models, especially long-context or real-time interactive systems.
Experience with PPO, GRPO, DPO, online RL, RLHF/RLAIF, reward modeling, preference data, synthetic data generation, or model-based data improvement.
Prior 0→1 experience building post-training systems, RL pipelines, agent training systems, evaluation platforms, or model improvement loops.
Experience with adjacent areas such as distributed pretraining, data infrastructure, inference serving, simulation, human/AI feedback collection, or evaluation infrastructure.
Publications or substantial open-source contributions in RL, post-training, alignment, evaluation, ML systems, or model behavior.

Responsibilities

Build Nuance’s RL/post-training stack from 0→1: rollout generation, policy optimization, reward/reference model serving, data feedback loops, evaluation, checkpointing, observability, and debugging.
Develop and scale post-training methods such as PPO, GRPO, DPO, rejection sampling, RLHF/RLAIF, online RL, and model-based data improvement.
Design the systems abstractions that connect research ideas to production-scale RL runs: trainers, rollout workers, reward models, evaluators, data queues, experience buffers, and checkpoint promotion.
Build evaluation and feedback loops for omni behavior: turn-taking, interruption, timing, emotional response, audiovisual coherence, instruction following, and real-time interaction quality.
Optimize the end-to-end post-training loop across rollout throughput, serving latency, GPU utilization, policy update efficiency, queueing, checkpoint overhead, and research iteration speed.
Evolve the platform as algorithms, model architectures, reward definitions, data sources, and evaluation methods change.