About The Position

For more than 25 years, NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing. Its unique legacy of innovation is motivated by great technology and amazing people. NVIDIA's invention of the GPU redefined modern computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company”. We are looking to continue to grow with the smartest people in the world. We are looking for you. We are seeking a Principal System Software Engineer, Cloud Networking and Performance, to develop massively scalable and performant Software-Defined Networking for the NVIDIA AI Factory software stack.

Requirements

  • BA/BS degree in Computer Science or, related technical field (or equivalent experience).
  • 15+ years of validated experience in networking disciplines.
  • Strong background in K8s and Networking performance.
  • Deep understanding of various networking protocols with hands-on development experience.
  • Experience in crafting network architecture for cloud/distributed systems.
  • Hands-on experience with large-scale network setups and device (switches/routers, etc.) configurations, along with network management systems, network monitoring systems, or network operations.
  • Strong understanding of OVS, OVN, OVN-K8s CNCF upstream project.
  • Hands-on experience with software for SR-IOV.

Responsibilities

  • Design and optimize next generation horizontally scalable, highly performant multi-tenant network architecture with security in mind to support data center and AI Factory infrastructure at massive scale.
  • Implement the above SDN software as a part of a bigger NVIDIA software stack for AI Factory, working closely with NVIDIA compute, storage and other teams.
  • Design and develop OSS software for monitoring the overall network health, performance, and scalability. The intelligence needs to be built into the system for tracking and regulating usage by various tenants, analyzing it, and taking appropriate actions.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service