About The Position

Want to join a team that's revolutionizing the field of AI with data center scale solutions? We are seeking a highly technical Senior AI compute Engineer to serve as a forward-deployed technical liaison between NVIDIA’s Partner Network (NPN), NVIDIA’s internal teams, and enterprise customers. Academic, Enterprise and government groups around the world are using NVIDIA products to revolutionize deep learning and data analytics, and to power data centers. Join the team building many of the largest and fastest AI Compute systems in the world! NVIDIA is looking for someone with the ability to work on a dynamic customer-focused team that requires excellent communication skills. This role will be interacting with customers, partners and internal teams to analyze, define, and implement large-scale AI Compute projects. These efforts include a combination of networking, system design and automation and validation This role embodies the forward-deployed engineer approach: you will spend significant time embedded with customer and partner technical teams. You will move fluidly between field deployments and internal NVIDIA collaboration. Your success is measured by the successful deployment and optimization of AI factories that enable customers and partners to achieve their AI ambitions. You will also provide critical feedback to refine NVIDIA’s reference architectures and partner enablement programs!

Requirements

  • 8+ years providing in-depth support and deployment services; solving problems for hardware and software products.
  • Knowledge and experience with Linux system administration, process and package management, task scheduling, kernel management, boot procedures/fixing, performance reporting/optimization/logging, network-routing/advanced networking (tuning and monitoring).
  • Cluster management and provisioning technologies for bare-metal servers (bonus credit for BCM (Base Command Manager)).
  • Minimum of a four-year degree from an accredited university or college in Computer Science, Electrical or Computer Engineering or equivalent experience.
  • Scripting proficiency (Bash, Python, Ansible, etc.).
  • Experience with schedulers such as SLURM, LSF, UGE, etc.
  • Excellent interpersonal skills and the ability to deliver resolutions for customer issues as they arise.
  • An ability to travel to customer sites within the United States up to 30% of the time.
  • Experience with benchmarking tools such as HPL, NCCL tests, MLPerf as well as Kubernetes experience.

Nice To Haves

  • InfiniBand experience.
  • Experience with GPU (Graphics Processing Unit) focused hardware/software.
  • Experience with MPI (Message Passing Interface).
  • Storage technologies such as Lustre or GPFS.
  • Familiarity with OEM GPU platforms
  • Strong channel sales and services knowledge and partner co-selling experience

Responsibilities

  • Deploying, managing, and validating AI Compute/HPC infrastructure in Linux-based environments for new and existing customers.
  • Be the domain expert with partners during planning calls through implementation.
  • Handover-related documentation and perform knowledge transfers required to support customers and partners as they begin rolling out some of the most sophisticated systems in the world!
  • Provide feedback to internal and partners teams such as opening bugs, documenting workarounds, and suggesting improvements.

Benefits

  • You will also be eligible for equity and benefits.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service