About The Position

The connectivity engineer translates product reference architectures and logical network diagrams into physical builds. This applies to NVIDIA's AI Factory build guidelines and NVIDIA's large-scale internal research clusters. This role will act as the lead engineer for all in-cluster cabling, pathway and rack layout optimizations required to power global-scale AI deployments, ensuring the cluster is co-designed with facilities infrastructure (Power&Cooling) and Infrastructure Software. This role provides an outstanding opportunity to be at the forefront of NVIDIA's technology roadmap!

Requirements

  • Minimum of 12+ years in a connectivity, network architecture or engineering role within a Hyperscale Cloud Provider, large-scale enterprise data center, or High-Performance Computing (HPC) environment.
  • BA or BS (or equivalent experience).
  • Consistent record of designing, deploying, and operating network fabrics for thousands of GPU/CPU nodes.
  • Deep expertise in high-speed interconnect technologies, including InfiniBand, RoCE, and RDMA.
  • Proven experience designing connectivity solutions for high-density GPU clusters (100kW+ per rack) and understanding the unique front-end and back-end requirements for AI training vs. inference.
  • Deep understanding of data center infrastructure, including rack power/cooling, cable management, and physical density constraints.
  • Demonstrated ability to lead multidisciplinary teams and complete sophisticated technical initiatives.

Nice To Haves

  • Deep expertise with NVIDIA's compute and network product families and deployment standards.
  • Comfortable operating at the intersection of network engineering, MEP systems, and Infrastructure as a Service software layer.
  • Experienced with field deployments and/or global reference design documentation, ideally both.

Responsibilities

  • Own the development of connectivity reference designs based on requirements from cluster architecture, network engineering, infrastructure software and product hardware teams.
  • Build and develop comprehensive documentation, including detailed rack elevations and network architecture diagrams and cabling point-to-point list.
  • Support projects throughout design and deployment phases.
  • Serve as the primary engineering support, closely collaborating with deployment and field teams to ensure successful cluster build-out and operation.
  • Strategically co-design the cluster with power and cooling infrastructure teams, ensuring a thorough understanding of all facility architectural requirements (Arch, power, cooling).
  • Work with hardware, network and security teams to translate software stack requirements into physical requirements: hardware selection, fault domain, network architecture.
  • Develop new solutions and products in the connectivity space to accelerate the deployment of large scale AI Factories

Benefits

  • You will also be eligible for equity and benefits.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service