Senior Deep Learning Scientist, Multimodal Conversational AI

NVIDIA•Santa Clara, CA

48d

About The Position

NVIDIA is widely regarded as one of the most desirable employers in technology. It leads in High-Performance Computing, Artificial Intelligence, and Visualization. Our invention, the GPU, acts as the visual cortex of modern computers and powers our products. GPU deep learning sparked modern AI, the next computing era. The GPU serves as the brain for computers, robots, autonomous cars, and conversational AI that understand the world. Today, we are known as “the AI computing company.” We want to grow and hire the smartest people. Join us at the forefront of technology. NVIDIA is hiring Senior Deep Learning Scientists interested in streaming multimodal conversational AI, including speech, audio, vision, voice chat, and action, as well as human-AI interaction. You will demonstrate foundational expertise in deep learning, reinforcement learning, computational statistics, and applied mathematics. You will have a chance to define core algorithmic improvements and scale your ideas through our Nemotron platform. You will work on high-impact, high-visibility large language model products that improve the experience for millions of users. If you are creative and passionate about real-world conversational AI issues, come join our Nemotron LLM team. For more details on Nemotron LLM, check https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/

Requirements

Master’s degree (or equivalent experience) or PhD in Computer Science, Electrical Engineering, Artificial Intelligence, or Applied Math with 8+ years of experience
Excellent programming skills in Python with strong fundamentals in programming, optimizations, and software development
Strong knowledge of ML/DL techniques, algorithms, and tools with exposure to CNN, RNN (LSTM), Transformers (ViT, BERT, BART, GPT/T5, Megatron, LLMs, MoEs)
Experience with training real-time audio language, streaming visual language, and streaming real-time audio-visual language models, and ViT, BERT, GPT, and Nemotron Models for different computer vision, NLP, and dialog system tasks using “PyTorch” Deep Learning Frameworks and performing data wrangling, tokenization, and multimodal alignment
Practical experience in natural language processing, speech/audio processing, computer vision, machine learning, and human-AI interaction
Hands-on experience on conversational AI Technologies like Natural Language Understanding, Natural Language Generation, Dialog systems (including system integration, state tracking, and action prediction), Information retrieval, Question and Answering, Machine Translation, etc.
Understanding of model development life cycle and experience with model development workflows & traceability, and versioning of datasets, including know-how of database management and queries (in SQL, MongoDB, etc.).
Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment

Nice To Haves

Native or near-native fluency is required in one of these non-English languages: Spanish, Mandarin, German, Japanese, Russian, French, UK English, Arabic, Korean, Italian, or Portuguese.
Verified background in building LLMs that incorporate knowledge discovery along with reasoning abilities, including disambiguation, clarification, anticipation, and effective error handling for embodied AI systems
Validated experience adapting LLMs to different domains such as gaming, virtual assistants, video conferencing, and so on
Contributing experience in integrating embodied AI systems with various sensor inputs (camera, microphone, torch, and so on) and backend action fulfillment systems
Experience with long-term reasoning for embodied AI tasks (navigation, mobile manipulation, instruction following, and collaboration with humans) in gaming/physical environments, given natural-language instructions.

Responsibilities

Develop, Train, Fine-tune, and Deploy streaming large language models to power multimodal conversational AI systems encompassing multimodal understanding, speech synthesis, speech-to-speech conversation, video generation, UI and animation rendering and control, environment interaction, and dialog reasoning and tool systems
Apply brand-new fundamental and applied research to develop products for multimodal conversational artificial intelligence
Apply techniques such as instruction tuning and reinforcement learning from human feedback (RLHF), reinforcement learning with verifiable reward (RLVR), and parameter-efficient finetuning methods like p-tuning, adapters, and LoRA. These methods improve embodied conversational LLMs for multiple use cases.
Lead the collection, development, and labeling of domain-specific datasets to train LLMs for various multimodal tasks and applications
Measure and benchmark model and application performance. Analyze model accuracy and bias and recommend the next course of action & improvements.
Collaborate with various teams on new product features and improvements of existing products
Participate in developing and reviewing code, building documents, and conducting use case reviews and test plan reviews.
Help innovate, identify problems, recommend solutions, and perform triage in a collaborative team environment

Benefits

With competitive salaries and a generous benefits package, NVIDIA is considered one of the technology world’s most desirable employers.
You will also be eligible for equity and benefits.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume