Senior Research Manager, World Model Evaluation

NVIDIA•Santa Clara, CA

1d•$272,000 - $431,250•Onsite

About The Position

At NVIDIA, we’re not just building the future, we’re generating it! Our world model team is pushing the boundaries of multimodal AI, robotics, and world foundation models for Physical AI. We are looking for a Senior Research Manager to lead world-model evaluation and benchmarking across NVIDIA’s Physical AI model portfolio. This role will build the team and research agenda for evaluating world models through closed-system evaluations, where the model under test is pluggable, and open-system evaluations, where access to model internals enables deeper diagnostics, causal analysis, and mechanistic evaluation. This is not only about leaderboards. It is about defining what makes a world model useful for Physical AI, discovering model failures, and turning those findings into better data, training recipes, model roadmaps, and downstream systems. The team will build a closed improvement loop across model evaluation, failure discovery, data generation, post-training, and re-evaluation.

Requirements

Strong research background in machine learning, computer vision, multimodal AI, robotics, world models, representation learning, model evaluation, or mechanistic interpretability.
Experience leading research teams, research programs, or cross-functional technical initiatives with measurable scientific and product impact.
Deep understanding of modern foundation models, including video models, vision-language-action models, diffusion or flow models, self-supervised learning, or world-model architectures.
Experience designing serious benchmarks, evaluation datasets, metrics, diagnostic tools, or model analysis frameworks for complex ML systems.
Familiarity with world-model evaluation and open-system analysis techniques, such as physical plausibility, temporal consistency, action conditioning, counterfactual reasoning, representation probing, activation patching, causal interventions, sparse autoencoders, or feature attribution.
PhD, or equivalent experience in Computer Science, Electrical Engineering, Robotics, Machine Learning, AI, or a related field, with 12+ overall years of relevant research or engineering experience as well as 5+ years of management experience.
Ability to work onsite at NVIDIA’s Santa Clara headquarters; this is not a remote position.

Nice To Haves

Built influential benchmarks, evaluation suites, model diagnostics, or interpretability tools used by research or production teams.
Published in areas such as world models, video generation, physical AI, embodied AI, robotics, representation learning, mechanistic interpretability, self-supervised learning, or model evaluation.
Experience evaluating generative video models, action-conditioned world models, robotics foundation models, world-action models, synthetic data generation systems, simulation systems, or vision-language-action models.
Strong point of view on what current benchmarks miss, and excitement to build the next generation of evaluation science for Physical AI.

Responsibilities

Lead a team of Research Scientists focused on world-model evaluation, benchmarking, and diagnostics for NVIDIA Physical AI models, including world foundation models, world-action models, synthetic data generation systems, robotics, simulation, and embodied AI workflows.
Define the scientific roadmap for closed-system and open-system evaluation, including open-loop and closed-loop benchmarks, metrics, failure taxonomy, model comparison, and evaluation-to-training feedback loops.
Develop benchmarks for physical plausibility, temporal consistency, scene dynamics, object permanence, spatial reasoning, action conditioning, affordances, controllability, long-horizon coherence, SDG quality, and WAM usefulness.
Develop open-system and mechanistic evaluation methods using model internals, including representation probing, causal interventions, activation analysis, ablations, sparse autoencoders, attention and feature analysis, and circuit-style diagnostics.
Drive evaluation-to-model-improvement loops with training, post-training, data curation, simulation, robotics, SDG, WAM, and applied research teams, including failure discovery, data generation, post-training priorities, model roadmap feedback, and re-evaluation.
Publish high-quality papers, technical reports, benchmarks, and open-source evaluation artifacts while establishing rigorous standards for validity, reproducibility, dataset hygiene, leakage prevention, and model comparison.