About The Position

NVIDIA Networking has been a leader in high performance networking infrastructure for many years. The next unit of computing is the datacenter, and the network makes it all possible! We are growing our networking architecture team with people passionate about accelerated computing. We are looking for you - a Networking Software Architect, to develop the next generation of networking protocols for AI. We are developing RDMA Transport protocols within the Networking software architecture team at NVIDIA. We build the underlying infrastructure under protocols such as RoCEv2 and NVIDIA Spectrum-X that is important for scaling AI. We're seeking a highly motivated, creative professional with networking simulation expertise along with experience in RDMA protocols to join the team. Efficient and fast communication between GPUs directly impacts end-to-end AI application performance. This impact continues to grow with the increasing scale of next generation systems. This is an outstanding opportunity to advance the state-of-the-art, break performance barriers, and deliver platforms the world has never seen before. Are you ready to build the new and innovative technologies that will help realize NVIDIA's vision? What you will be doing: Perform networking simulations of communication patterns prevalent in AI applications, such as using NCCL. Design and implement new techniques and protocols to accelerate the communication performance. Explore innovative solutions in HW and SW for our next generation platforms as part of programmable RoCE architecture. Build proofs-of-concept, conduct experiments, and perform quantitative modeling to evaluate and drive new innovations. Use simulation to explore performance of AI applications on large GPU clusters.

Requirements

  • M.S./Ph.D. degree in CS/CE or equivalent experience.
  • 5+ years of relevant experience.
  • Excellent C/C++ programming and debugging skills.
  • Experience with network simulations.
  • Deep understanding of RDMA.
  • Proven fundamentals of compute, network architecture and operating systems.
  • Strong experience with Linux.
  • Ability and flexibility to work and communicate effectively in a multi-national, multi-time-zone corporate environment.

Nice To Haves

  • Expertise in related technology and passion for what you do.
  • Experience with NCCL Collectives along with AI communication patterns and parallelization techniques.
  • Strong collaborative and interpersonal skills and a proven track record of effectively guiding and influencing within a dynamic and multi-functional environment.

Responsibilities

  • Perform networking simulations of communication patterns prevalent in AI applications, such as using NCCL.
  • Design and implement new techniques and protocols to accelerate the communication performance.
  • Explore innovative solutions in HW and SW for our next generation platforms as part of programmable RoCE architecture.
  • Build proofs-of-concept, conduct experiments, and perform quantitative modeling to evaluate and drive new innovations.
  • Use simulation to explore performance of AI applications on large GPU clusters.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service