Senior Scientist, Synthetic Data Generation

NVIDIASanta Clara, MA
$168,000 - $264,500

About The Position

NVIDIA is at the forefront of the AI revolution, and our research is shaping the future of large language models. We are looking for a Senior Scientist to join our team and help advance our capabilities in synthetic data generation for training frontier models. You will contribute to open-source libraries within the NVIDIA NeMo ecosystem that generate synthetic datasets across text, code, structured, and multimodal data, directly feeding the pre- and post-training of LLMs such as Nemotron. This role combines hands-on software engineering with applied research in generative methods, and you will collaborate with research, engineering, product, and model teams as well as external labs.

Requirements

  • PhD in Computer Science, Machine Learning, Statistics, or a related field, or equivalent experience.
  • A research background of 3+ years in synthetic data generation, generative modeling, multimodal machine learning, or related areas. Comparable experience is also considered.
  • Deep technical understanding of LLMs, how data shapes their pre- and post-training, and inference frameworks such as vLLM or TGI.
  • Proven track record of developing or maintaining software libraries used by a broad developer community.
  • Strong publication record at premier venues such as NeurIPS, ICML, ICLR, ACL or similar.

Nice To Haves

  • Open-source contributions in ML or data tooling.
  • Experience with multimodal generation or understanding (vision-language, document AI, video, or audio).
  • Building and optimizing scalable data pipelines for large-scale model training (throughput, distributed inference).
  • Experience generating data for agentic, tool-use, or reinforcement-learning post-training.

Responsibilities

  • Build synthetic data generation pipelines using LLM-based methods and automated quality evaluation, producing datasets that improve the pre- and post-training of LLMs such as Nemotron — reasoning, coding, structured output, and multimodal understanding.
  • Advance multimodal synthetic data generation — image, document, video, and audio — in partnership with NVIDIA's model teams.
  • Design and maintain open-source libraries and SDKs with clean APIs and strong documentation.
  • Drive software excellence with modern tooling, architecture based on configuration, and professional Git/CI-CD.
  • Publish original research at top machine learning and AI conferences to maintain NVIDIA's technical leadership.
  • Mentor interns and junior researchers to develop technical growth within the team.

Benefits

  • equity
  • benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service