About The Position

NVIDIA is seeking a motivated Deep Learning engineer to integrate advanced CUDA features and Distributed Runtime technologies into AI stacks, including PyTorch, TRT-LLM, vLLM, SGLang, JAX, etc. You will join the team responsible for core CUDA features and runtimes for scaling Deep Learning and HPC applications. The role involves addressing diverse multi-GPU demands, from training on scales up to 100K GPUs to inference with microsecond latency. Your work will enhance both the productivity and performance of AI applications, accelerating their adoption by the community. This is a significant opportunity for individuals with an AI background to contribute to state-of-the-art advancements.

Requirements

  • BS, MS, or PhD degree in Computer Science, Computer Engineering, Electrical Engineering, or related field (or equivalent experience).
  • 8+ years of relevant industry experience or equivalent academic experience after completed degree.
  • Development experience with Deep Learning Frameworks such as PyTorch, JAX, and Inference Engines such as TRT-LLM, vLLM, SGLang.
  • Rapid prototyping and development with Python, C++, CUDA or related DSLs.
  • Solid grasp of AI models, parallelisms, and/or compiler technologies (e.g. torch.compile).
  • Experience conducting performance benchmarking on AI clusters.
  • Familiarity with at least one performance profiler toolchain (PyTorch profiler, NVIDIA Nsight Systems).
  • Understanding of HPC/AI communication concepts.
  • Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals).
  • Adaptability and passion to learn new frameworks and tools.
  • Flexibility to work and communicate effectively across different teams and timezones.

Nice To Haves

  • Deep expertise in the performance internals and execution graphs of major deep learning autograd, training and inference frameworks (e.g., PyTorch, JAX, TensorRT, vLLM, sgLang, Nemo, Megatron, MaxText, etc.).
  • Hands-on experience with CUDA, specific communication libraries (e.g., NCCL, MPI, UCX) and distributed machine learning techniques (e.g., pipeline parallelism, tensor parallelism).
  • Expertise in one or more of these areas: Training, Distributed inference, MoE, Reinforcement Learning, kernel authoring (on CUDA, Triton, cuTe, etc).
  • Background in deep learning compilers, both graph-level and codegen (e.g., Triton, XLA, torch compile).
  • Experience with programming for compute & communication overlap in distributed runtime.

Responsibilities

  • Integrate new CUDA features and Runtime abstractions in AI frameworks, from PoC to performance analysis to production.
  • Perform deep analysis of AI workloads and frameworks to identify requirements and opportunities for innovation in the lower layers of the stack.
  • Collaborate hands-on with teams working on the latest AI models.
  • Own and drive improvements in the AI Compiler-Runtime interface to build high-performance multi-GPU, multi-node solutions.
  • Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads.
  • Influence the roadmap of core CUDA to facilitate the development of next-generation DL frameworks.
  • Collaborate with a dynamic team across multiple time zones.
  • Collaborate closely with AI researchers, HW and SW architects, kernel and compiler authors, and CUDA driver experts to co-design systems and frameworks that enhance performance and programmability.
  • Develop exploratory tools and runtime systems to profile and accelerate new paradigms in deep learning.
  • Write clean, effective, and maintainable code, ensuring exploratory prototypes can smoothly transition into open-source releases, upstream framework integrations, internal tools, or closed-source commercial products.

Benefits

  • Equity
  • Benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service