About The Position

NVIDIA is searching for a senior or principal engineer who specializes in building cutting-edge infrastructure for large-scale foundation model training in the Generalist Embodied Agent Research (GEAR) group. Our team is leading Project GR00T, NVIDIA’s moonshot initiative at building foundation models and full-stack technology for humanoid robots. You will work with an amazing and collaborative research team that consistently produces influential works on multimodal foundation models, large-scale robot learning, embodied AI, and physics simulation. Our past projects include Eureka, VIMA, Voyager, MineDojo, MimicPlay, Prismer, and more. Your contributions will have a significant impact on our research projects and product roadmaps.

Requirements

  • Bachelor's degree in Computer Science, Robotics, Engineering, or a related field
  • 10+ years of full-time industry experience in large-scale MLOps and AI infrastructure
  • Proven experience designing and optimizing distributed training systems with frameworks like PyTorch, JAX, or TensorFlow.
  • Deep understanding of GPU acceleration, CUDA programming, and cluster management tools like Kubernetes.
  • Strong programming skills in Python and a high-performance language such as C++ for efficient system development.
  • Strong experience with large-scale GPU clusters, HPC environments, and job scheduling/orchestration tools (e.g., SLURM, Kubernetes).

Nice To Haves

  • Master’s or PhD’s degree in Computer Science, Robotics, Engineering, or a related field
  • Demonstrated Tech Lead experience, coordinating a team of engineers and driving projects from conception to deployment
  • Strong experience at building large-scale LLM and multimodal LLM training infrastructure
  • Contributions to popular open-source AI frameworks or research publications in top-tier AI conferences, such as NeurIPS, ICRA, ICLR, CoRL.

Responsibilities

  • Design and maintain large-scale distributed training systems to support multi-modal foundation models for robotics.
  • Optimize GPU and cluster utilization for efficient model training and fine-tuning on massive datasets.
  • Implement scalable data loaders and preprocessors tailored for multimodal datasets, including videos, text, and sensor data.
  • Develop robust monitoring and debugging tools to ensure the reliability and performance of training workflows on large GPU clusters.
  • Collaborate with researchers to integrate cutting-edge model architectures into scalable training pipelines.

Benefits

  • Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
  • You will also be eligible for equity and benefits.
  • NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service