NVIDIA is seeking a strong technology leader for our Engineering Operations and Site Reliability Engineering for our next-generation datacenter server systems. This role sits at the intersection of execution, reliability, automation, and large-scale system operations, where we keep NVIDIA’s rack-scale systems healthy, observable, and highly available for internal engineering users. These systems bring together the full power of NVIDIA CPUs, GPUs, NVLink, InfiniBand/Spectrum-X networking, cluster management technologies, and our optimized AI/HPC software stack. We enable fast product development by ensuring large internal racks, clusters, and lab infrastructure are reliable, well-instrumented, and operated with scalable engineering practices. This is a technical leadership role focused on execution excellence for large-scale internal datacenter systems. The ideal candidate has strong engineering judgment, experience operating complex distributed infrastructure, and the ability to build teams that combine focused operations with automation-first software engineering.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Director