Director, Engineering Operations and Site Reliability Engineering - Datacenter Server Systems

NVIDIA•Us, CA

About The Position

NVIDIA is seeking a strong technology leader for our Engineering Operations and Site Reliability Engineering for our next-generation datacenter server systems. This role sits at the intersection of execution, reliability, automation, and large-scale system operations, where we keep NVIDIA’s rack-scale systems healthy, observable, and highly available for internal engineering users. These systems bring together the full power of NVIDIA CPUs, GPUs, NVLink, InfiniBand/Spectrum-X networking, cluster management technologies, and our optimized AI/HPC software stack. We enable fast product development by ensuring large internal racks, clusters, and lab infrastructure are reliable, well-instrumented, and operated with scalable engineering practices. This is a technical leadership role focused on execution excellence for large-scale internal datacenter systems. The ideal candidate has strong engineering judgment, experience operating complex distributed infrastructure, and the ability to build teams that combine focused operations with automation-first software engineering.

Requirements

BS or MS in Computer Science, Electrical Engineering, Computer Engineering, or related field (or equivalent experience).
12+ overall years of experience in infrastructure, systems engineering, reliability, datacenter operations, distributed systems, or related areas, including 7+ years of people management experience.
Strong understanding of server systems, Linux, cluster operations, high-speed networking, and large-scale infrastructure.
Experience operating complex systems with high availability expectations, including monitoring, incident management, automation, and fleet-health practices.
Proven track record of driving execution across multiple teams, priorities, and technical domains, including close partnership with hardware, firmware, software, networking, validation, and infrastructure organizations.
Clear written and verbal communication skills, including executive-level reporting on operational health, risks, and priorities.
Track record of building cohesive teams and developing technical leaders who improve reliability and execution.

Nice To Haves

Prior Director or Senior Manager experience leading infrastructure, reliability, platform engineering, or large-scale lab operations teams.
Experience operating GPU, AI, HPC, cloud, or hyperscale datacenter infrastructure.
Broad knowledge of rack-scale systems, including server management, networking, storage, power, thermal, and RAS concepts.
Experience building automation, telemetry, fleet health, or dashboarding systems that improve product quality, serviceability, or engineering velocity.

Responsibilities

Lead teams that help us ensure NVIDIA’s internal rack-scale server systems, clusters, and lab facilities remain available, healthy, and reliable.
Drive execution across fleet operations, incident response, roadmap planning, change management, operational readiness, and reliability metrics.
Build automation, telemetry, alerting, and dashboards that improve visibility and help teams resolve issues faster.
Partner with hardware, firmware, software, networking, validation, and infrastructure teams to deploy, sustain, and debug complex systems.
Create feedback loops into NPI and sustaining teams to improve product quality, serviceability, and development velocity.
Grow and mentor a high-performing technical team with a culture of ownership, learning , and automation-first execution.