Senior Systems Performance Engineer

NVIDIAUs, CA
Onsite

About The Position

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world. NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice to join us today. We are now looking for a Senior Validation Engineer in the DGX Server Product Engineering Team. In this role you will be working with a team of HW/SW engineers to develop and implement complex automated test plans for our industry leading GPU accelerated computing products. What you will be doing: System architecture, design, performance modelling, estimation across new models and new packages. Enable GPU SKU bring up, validation and model enablement. Develop system level stress and performance testing strategies using industry leading Deep Learning/AI applications.

Requirements

  • Ability to work on site in hardware lab environment 5 days a week
  • BSEE or BSCE or equivalent experience
  • 5+ years or more of experience in validating and debugging complex systems.
  • Developing/running real world ML/LLM workload.
  • Dynamo, TensorRT, Slurm, BCM skills mandatorily required.
  • Knowledge of vLLM, SG Lang preferred.
  • Proficiency in Cuda, Cublas and Cutlass
  • Deep understanding of computing architectures.
  • Coding experience with python programming, running simulators.
  • Experience with datacenter products including system management, security, networking, and storage.

Nice To Haves

  • Background with x86/Arm server architectures and accelerated GPU computing.
  • Track record of continuous process improvement with a passion for tools and automation.

Responsibilities

  • System architecture, design, performance modelling, estimation across new models and new packages.
  • Enable GPU SKU bring up, validation and model enablement.
  • Develop system level stress and performance testing strategies using industry leading Deep Learning/AI applications.

Benefits

  • NVIDIA offers highly competitive salaries and a comprehensive benefits package.
  • Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
  • You will also be eligible for equity and benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service