About The Position

NVIDIA is building the world’s most groundbreaking and innovative accelerated computing platforms for AI and HPC. Because of our work, scientists, researchers, and engineers can push the boundaries of what’s possible. We pioneered a supercharged form of computing that powers everything from breakthrough AI research to the world’s fastest supercomputers. We are seeking a highly motivated Senior Solutions Architect to join the Cluster Design and Architecture team with a focus on GPU, NVLink, and infrastructure design. In this role, you will be at the forefront of assisting with designs and architectures for some for the largest next-generation GPU-based clusters enabling the world’s most advanced AI supercomputers and enterprise AI infrastructure in the field. As a Solutions Architect, you will serve as a key technical expert bridging NVIDIA’s ground breaking GPU and NVLink technology designs as well as all of our software solutions directly between engineering and field teams supporting customers with the most demanding requirements. You will work on end-to-end cluster design and architecture, performance modeling, validation, and NPI cluster deployments. Your expertise will directly influence how the world’s leading AI companies, cloud providers, hyperscalers, research institutions, and enterprises build their infrastructure.

Requirements

  • BS, MS, or PhD in Computer Science, Electrical Engineering, Computer Engineering, Physics, or related field (or equivalent experience)
  • 8+ years of experience in cluster design, validation, and issue resolution, specifically on GPU and HPC clusters
  • Proven expertise in designing large-scale distributed systems, AI clusters, or HPC infrastructure
  • Ability to translate sophisticated engineering concepts into customer-ready documentation, diagrams, and reference material
  • Expertise in driving customer/partner issues to a close with product and engineering teams
  • Ability to handle multi-functional communications across customer, product team, support team, engineering team, etc.

Nice To Haves

  • Experience leading large-scale AI Factory or HPC cluster bring-ups or builds
  • Hands-on experience with NVIDIA products including, but not limited to, GPUs, NVLink, NVIDIA Networking, etc.; specifically debugging issues that occur during deployment on NVLink, etc.
  • Knowledge of NCCL, MPI, IMEX, NMX, and collectives in distributed training as it pertains to cluster designs
  • External customer facing skill-set and background
  • Effective time management and capability to balance multiple tasks and customers while thinking creatively to debug and solve problems

Responsibilities

  • Partner with internal engineering efforts in GPU cluster design and networking and convey architecture and optimal process information both direct to customer and with field teams supporting customers
  • Guide field teams and their customers in cluster design, weighing design principles but also complex, situational limitations to make the most performant and supportable GPU clusters possible
  • Work closely with field teams supporting customers to ensure successful first deployments with new products, including new network architectures and topologies
  • Feedback customer/field perspectives on cluster design and workflows back to engineering teams designing internal clusters and/or creating customer facing documentation on standard processes and service flows
  • Perform hands-on work to assist field teams debugging issues relating to cluster design, configuration, and performance employing internal engineering expertise and known bugs
  • Support NPI customer deployments with new GPU/Networking architectures

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

Ph.D. or professional degree

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service