About The Position

We are seeking Lead Systems Quality and Reliability Engineer to join our LPU team! NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 fueled the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence! What you'll be doing: You will own, build, and manage the RMA and FA debug and root-cause analysis for existing and new Nvidia AI/ML products. You will conduct tests, and root-cause analysis.

Requirements

  • BS/MS in EE, Physics or a related degree (or equivalent experience)
  • 8+ yrs of hands on systems test and/or validation engineering experience
  • Proven hands-on management and leadership experience
  • Competence using lab equipment such as oscilloscopes, logic analyzers, power analyzers etc.
  • Experience with enabling reliability tests such as HTOL and quality tests such as Burn in
  • Strong knowledge of Fault isolation techniques such as OBIRCH, DLS/LADA, LVP and LVI
  • Proficiency with high speed interfaces (SerDes, PCIe, DDR)
  • Proficiency in Python, PERL, C++, or other languages on UNIX /Linux
  • Excellent knowledge of PCB card and system level test and debug as well as be able to manage factory floor partners (CMs) for RMA/FA activities

Nice To Haves

  • Ideal candidate will have working knowledge of FA techniques and tools such as FIB, SEM, TDR, VNA and CSAM

Responsibilities

  • Conduct and lead debug and root-cause analysis of field RMAs.
  • Collaborate with Systems Engineers, Hardware engineers, Software engineers, and operations engineers as required
  • Scale root cause FA capabilities within your organization
  • Create FA result reports that align with standard 8D or similar process
  • Analyze RMA, FA and repair data. Identify trends and raise quality alerts when necessary.
  • Drive resolution, containment, and mitigation plans for such quality alerts
  • Oversee hardware quality performance, monitoring field quality data and associated metrics including RMA rates, MTBF, and Reliability Ratio
  • Manage operational perf of FA at CMs, ensuring partner achieve key perf indicators including FA cycle times, fault duplication rates and fault isolation rates
  • Oversee the setup of new products into Failure Analysis operations
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service