Senior HPC and AI Networking Performance Research and Analysis Engineer

Nvidia-posted about 1 year ago

$148,000 - $276,000/Yr

Full-time • Senior

Santa Clara, CA

Computer and Electronic Product Manufacturing

Resume

Match Score

Upload and Match ResumeTrack Jobs with Teal

NVIDIA is seeking a Senior High Performance Computing (HPC) and AI Networking Performance Research and Analysis Engineer to join our Performance group. This role involves profiling and analyzing AI workloads on large GPU and CPU scale clusters for distributed Deep Learning LLM training, with a focus on collective communication and networking. The engineer will develop performance analysis tools and methodologies to understand performance expectations, limitations, and bottlenecks in high-performance networking environments.

Exploring and researching AI workloads and DL models for large-scale deep learning LLM training on NVIDIA supercomputers and distributed systems.
Benchmarking, profiling, and analyzing performance to identify bottlenecks and areas for improvement, with a focus on networking aspects.
Implementing performance analysis tools.
Collaborating with teams from hardware to software to provide performance analysis insights.
Defining performance test planning and setting performance expectations for new technologies and solutions.

B.Sc in Computer Science or Software Engineering or equivalent experience.
5+ years of experience with high-performance Networking (RDMA, MPI, NCCL, Congestion Control Algorithms).
Demonstrated performance analysis skills and methodologies.
Experience with NVIDIA GPUs, CUDA library, and deep learning frameworks like TensorFlow or PyTorch.
Expertise in networking collective communication libraries (such as NCCL) and protocols (such as RoCE and RDMA).
Strong analytical and problem-solving skills with fast self-learning capabilities.
Proficiency in programming languages: Python, Bash, and C.
Experience with Linux OS distros.
Good communication and interpersonal skills.

In-depth knowledge and experience with AI workloads and benchmarking for distributed LLM training.
Knowledge in CUDA and NCCL libraries.
Knowledge in Congestion Control algorithms.
In-depth system knowledge (Intel / AMD / ARM CPUs, NVIDIA GPUs, HCA, Memory, PCI).
Strong performance analysis skills using modern tools.

Highly competitive salaries
Comprehensive benefits package
Equity options
Diverse and supportive work environment

Track Jobs with Teal

Job Search Resources

•

Resume Builder

•

Resume Examples

•

Cover Letter Examples

Senior HPC and AI Networking Performance Research and Analysis Engineer

Job Search Resources

Tools

Career Hubs

Guides

Company