About The Position

NVIDIA is seeking a strong technology leader for our Engineering Operations and Site Reliability Engineering for our next-generation datacenter server systems. This role sits at the intersection of execution, reliability, automation, and large-scale system operations, where we keep NVIDIA’s rack-scale systems healthy, observable, and highly available for internal engineering users. These systems bring together the full power of NVIDIA CPUs, GPUs, NVLink, InfiniBand/Spectrum-X networking, cluster management technologies, and our optimized AI/HPC software stack. We enable fast product development by ensuring large internal racks, clusters, and lab infrastructure are reliable, well-instrumented, and operated with scalable engineering practices. This is a technical leadership role focused on execution excellence for large-scale internal datacenter systems. The ideal candidate has strong engineering judgment, experience operating complex distributed infrastructure, and the ability to build teams that combine focused operations with automation-first software engineering.

Requirements

  • BS or MS in Computer Science, Electrical Engineering, Computer Engineering, or related field (or equivalent experience).
  • 12+ overall years of experience in infrastructure, systems engineering, reliability, datacenter operations, distributed systems, or related areas, including 7+ years of people management experience.
  • Strong understanding of server systems, Linux, cluster operations, high-speed networking, and large-scale infrastructure.
  • Experience operating complex systems with high availability expectations, including monitoring, incident management, automation, and fleet-health practices.
  • Proven track record of driving execution across multiple teams, priorities, and technical domains, including close partnership with hardware, firmware, software, networking, validation, and infrastructure organizations.
  • Clear written and verbal communication skills, including executive-level reporting on operational health, risks, and priorities.
  • Track record of building cohesive teams and developing technical leaders who improve reliability and execution.

Nice To Haves

  • Prior Director or Senior Manager experience leading infrastructure, reliability, platform engineering, or large-scale lab operations teams.
  • Experience operating GPU, AI, HPC, cloud, or hyperscale datacenter infrastructure.
  • Broad knowledge of rack-scale systems, including server management, networking, storage, power, thermal, and RAS concepts.
  • Experience building automation, telemetry, fleet health, or dashboarding systems that improve product quality, serviceability, or engineering velocity.

Responsibilities

  • Lead teams that help us ensure NVIDIA’s internal rack-scale server systems, clusters, and lab facilities remain available, healthy, and reliable.
  • Drive execution across fleet operations, incident response, roadmap planning, change management, operational readiness, and reliability metrics.
  • Build automation, telemetry, alerting, and dashboards that improve visibility and help teams resolve issues faster.
  • Partner with hardware, firmware, software, networking, validation, and infrastructure teams to deploy, sustain, and debug complex systems.
  • Create feedback loops into NPI and sustaining teams to improve product quality, serviceability, and development velocity.
  • Grow and mentor a high-performing technical team with a culture of ownership, learning , and automation-first execution.

Benefits

  • equity
  • benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service