Senior AI-HPC Cluster Engineer

$148,000 - $339,250/Yr

Nvidia - Santa Clara, CA

posted 23 days ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

As a Senior AI-HPC Cluster Engineer at NVIDIA, you will lead the design and implementation of advanced GPU compute clusters that support demanding deep learning and high-performance computing workloads. This role involves addressing strategic challenges related to compute, networking, and storage design, while also focusing on effective resource utilization and evolving cloud strategies within a global computing environment.

Responsibilities

  • Building and improving the ecosystem around GPU-accelerated computing, including developing large scale automation solutions.
  • Maintaining and building deep learning clusters at scale.
  • Supporting researchers in running their workflows on clusters, including performance analysis and optimizations of deep learning workflows.
  • Conducting root cause analysis and suggesting corrective actions for problems of various scales.
  • Proactively finding and fixing problems before they occur.

Requirements

  • Bachelor's degree in Computer Science, Electrical Engineering, or related field, or equivalent experience.
  • Minimum 5 years of experience designing and operating large scale compute infrastructure.
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads.
  • Working knowledge of cluster configuration management tools such as Ansible, Puppet, or Salt.
  • Experience with AI/HPC advanced job schedulers, ideally familiar with Slurm, K8s, RTDA, or LSF.
  • In-depth understanding of container technologies like Docker, Singularity, Shifter, or Charliecloud.
  • Proficient in CentOS/RHEL and/or Ubuntu Linux distros, including Python programming and bash scripting.
  • Experience with AI/HPC workflows that use MPI.

Nice-to-haves

  • Experience with NVIDIA GPUs, CUDA Programming, NCCL, and MLPerf benchmarking.
  • Experience with Machine Learning and Deep Learning concepts, algorithms, and models.
  • Familiarity with InfiniBand with IBOP and RDMA.
  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow.

Benefits

  • Highly competitive salaries
  • Comprehensive benefits package
  • Equity options
  • Ongoing application acceptance
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service