About The Position

NVIDIA is at the forefront of the AI revolution, and our datacenter products are the engines powering this transformation. We seek a Senior Test Development Engineer to join our Silicon Solutions Architecture Development team. In this critical role, you will design next-generation testing methodologies that ensure the performance, reliability, and integrity of pioneering GPU server systems used in the world's most demanding computing environments. If you thrive on solving sophisticated technical challenges, shaping future hardware, and ensuring flawless product quality, this is your opportunity to make a direct impact on the future of AI and high-performance computing.

Requirements

  • BS/MS in Electrical Engineering, Computer Engineering, Computer Science, or related field (or equivalent experience).
  • 8+ years of experience in hardware validation, test development, or datacenter hardware engineering.
  • Expert programming skills in Python and/or C/C++ for automation and tool development.
  • Deep Linux/Unix expertise, including advanced shell scripting.
  • Strong knowledge of server architecture: CPUs, GPUs, PCIe, networking, and storage.
  • A hard-working, proactive approach with a proven ability to own and deliver complex projects.

Nice To Haves

  • Hands-on experience with NVIDIA GPU architecture (Hopper, Ampere) and software stack (CUDA, NCCL).
  • Experience testing high-speed interconnects such as NVLink or InfiniBand.
  • Familiarity with AI/ML or HPC benchmarking and stress-testing tools.
  • Proven track record of identifying and resolving critical bugs in pre-production hardware.

Responsibilities

  • Innovate & Build - Design and implement novel test plans, tools, and automation frameworks to validate GPU functionality, performance, and reliability in complex datacenter environments.
  • Safeguard Data Integrity - Develop groundbreaking stress tests and methodologies to detect, characterize, and eliminate silent data errors.
  • Build the Future of Hardware - Partner with architecture and silicon construction teams to influence system and chip-level features that improve diagnostics, debuggability, and root-cause analysis.
  • Deep Dive Debugging - Analyze test results, investigate complex failures, and drive solutions in close collaboration with design, firmware, and software teams.
  • Lead & Mentor - Provide technical leadership, guide junior engineers, and shape validation strategy across datacenter product lines.

Benefits

  • competitive compensation
  • industry-leading benefits
  • equity
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service