About The Position

We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for rack-scale system SW/FW, working with CSP engineering teams to ensure they can deploy, monitor, and operate these systems reliably at fleet scale. In this role, you will collaborate with NVIDIA's cross-functional rack-scale system SW/FW engineering teams with dedicated CSP-facing technical leadership. Your focus is on the system-level software that manages, monitors, and recovers the rack as a whole — fabric management, GPU/NVSwitch error handling and recovery, health telemetry APIs, firmware update orchestration, and SW-driven serviceability. You will drive work streams with CSP engineering teams to build shared understanding of the architecture, incorporate their operational feedback, and ensure integration readiness.

Requirements

  • 15+ years of experience in system software, platform firmware, or large-scale distributed systems engineering.
  • BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • Deep understanding of rack-scale system software challenges: multi-component coordination, error propagation, health monitoring, and serviceability / reliability
  • Experience with fabric management software, cluster management, or system-level orchestration frameworks.
  • Familiarity with firmware architectures and update lifecycle management (multi-component update sequencing, rollback, recovery)
  • Understanding of error handling and recovery design patterns in distributed systems — fault isolation, retry policies, graceful degradation
  • Experience with health monitoring and telemetry systems: health scoring, event correlation, API design for fleet-level observability
  • Understanding of GPU or accelerator system software (drivers, device management, power management) is a strong plus
  • Customer obsession — genuine passion for understanding how CSPs operate sophisticated systems at fleet scale and simplifying their experience
  • Proven success providing technical leadership across organizational boundaries and influencing system software design without direct authority.
  • Strong communication — ability to translate complex system software architecture into actionable mentorship for customer engineering teams

Nice To Haves

  • Experience with NVIDIA NVSwitch, NVOS, or GPU fabric management software
  • Background in system software for large-scale clusters at a hyperscaler (cluster management, fleet orchestration, health platforms)
  • Experience crafting error handling and recovery frameworks for multi-component systems (hundreds or thousands of coordinating devices)
  • Familiarity with GPU or accelerator fleet operations — driver lifecycle, firmware rollout strategies, health-based scheduling
  • Understanding of how system software decisions impact serviceability, availability, and operational cost at fleet scale

Responsibilities

  • Drive rack-scale SW/FW architecture alignment across CSP engagements — including fabric management software, link health monitoring, GPU/NVSwitch error handling, SW/FW serviceability features (e.g., hot-plug support, component isolation, firmware-driven recovery), and multi-component firmware orchestration
  • Drive technical work streams with CSP engineering teams on rack-scale system software — ensuring they deeply understand fabric management, NVSwitch behavior, error handling and recovery policies, health telemetry APIs, and SW/FW-controlled recovery operation
  • Capture and synthesize CSP engineering feedback on rack-scale system software — health monitoring APIs, SW-driven serviceability workflows, firmware update orchestration, and error recovery behavior — champion that feedback into NVIDIA's architecture decisions
  • Collaborate with multi-functional teams to ensure customer operational requirements are reflected in system software and firmware development
  • Identify cross-CSP patterns in rack-scale SW/FW issues, error handling behavior, and system configuration practices — drive documentation, tooling, and test strategy improvements as a result
  • Collaborate with execution teams on left-shift strategy — ensuring customer-side SW/FW integration work is identified early and completed ahead of hardware availability
  • Make critical technical decisions on rack-scale system SW/FW tradeoffs and mitigate execution risks through early engagement with CSP engineering teams

Benefits

  • equity
  • benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service