About The Position

We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for fleet-scale reliability, working directly with engineering teams of key CSP / hyperscale customers to ensure NVIDIA platforms achieve target MTBI (Mean Time Between Interruptions) in production. In this role, you will augment NVIDIA's internal software/firmware and quality teams with a dedicated CSP-facing focus. You will drive work streams with CSP engineering teams to build shared understanding of reliability software/firmware architecture, methodology, incorporate their fleet telemetry and failure data into NVIDIA's improvement priorities, and validate that reliability improvements measured in the lab translate to real customer environments. Your cross-CSP visibility enables you to distinguish systemic architectural gaps from environmental or configuration-specific issues that no single customer engagement could identify alone.

Requirements

  • 15+ years of experience in systems software at datacenter scale, or reliability engineering with focus on at-scale challenges.
  • BS or MS in Computer Science, Electrical Engineering, Statistics, or related field (or equivalent experience)
  • Deep expertise in multi-NUMA, rack-scale system software and firmware.
  • Statistical failure analysis methods: MTBF/MTBI calculation, Pareto analysis, root cause classification
  • Experience with fleet-level telemetry and observability systems: time-series databases, anomaly detection, health scoring, event correlation
  • Understanding of hardware failure modes in large-scale GPU/accelerator deployments — ability to classify and prioritize across compute, interconnect, memory, power, and thermal domains
  • Experience defining or operating burn-in, stress testing, or certification frameworks for complex hardware systems.
  • Familiarity with predictive maintenance or anomaly detection approaches applied to fleet health data
  • Customer obsession — genuine passion for understanding fleet reliability challenges at scale and translating them into actionable engineering priorities
  • Strong communication — ability to present statistical reliability findings to both deep technical audiences and executive leadership.
  • Demonstrated success driving cross-functional improvements across hardware, firmware, and software teams without direct authority

Nice To Haves

  • Experience in fleet reliability at a hyperscaler (hardware health, fleet reliability at leading CSP/Hyperscaler)
  • Familiarity with NVIDIA GPU error taxonomy (Xid errors, NVLink error counters, thermal events, CPER records)
  • Experience building health scoring or predictive failure models for accelerator or HPC infrastructure
  • Background in defining MTBI/MTBF measurement standards or certification programs for complex multi-component systems
  • Understanding of how reliability data flows from device firmware through telemetry pipelines to fleet-level dashboards and automated remediation

Responsibilities

  • Drive reliability work streams with CSP engineering teams — ensuring shared understanding of MTBI measurement methodology, failure classification, and health monitoring architecture
  • Gather and synthesize CSP fleet reliability data — identify failure patterns that appear across multiple customers and champion improvements back into NVIDIA's firmware, driver, and hardware teams
  • Define consistent MTBI measurement methodology that works across different CSP monitoring environments and operational practices
  • Conduct fleet-scale failure pattern analysis using statistical methods (Pareto, survival analysis, Weibull) to classify failures as systemic, environmental, or configuration-specific
  • Drive fleet health monitoring integration architecture — ensure NVIDIA's health agents, telemetry, and reporting align with CSP operational workflows and automation
  • Define burn-in reliability test environment and cluster certification criteria in collaboration with quality teams, validating with customers that criteria are meaningful
  • Collaborate with CSPs to ensure reliability-related integration work (health monitoring deployment, telemetry pipeline, alerting configuration) is complete ahead of at-scale launch
  • Develop predictive failure models using fleet telemetry and validate their effectiveness in customer environments

Benefits

  • equity
  • benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service