NVIDIA sits at the center of the AI revolution, and the teams behind our data and observability platforms keep the whole engine running! We’re hiring Site Reliability Engineers who want to work on the systems that power everything from large-scale data pipelines to model training clusters to real-time decision making. This isn’t a typical SRE role, you’ll help design and run NVIDIA’s global telemetry backbone, the platform that carries metrics, logs, traces, and profiling data for some of the most demanding workloads in the world. You’ll shape how our AI and data systems are built, set reliability standards, and solve scaling challenges that come with operating at NVIDIA’s pace and scale. If you enjoy diving into distributed systems, building automation that eliminates toil, and partnering with infra and application teams to raise the reliability bar, this is a place where your work will have real impact. And you’ll be joining a group that values curiosity, learning, and blameless engineering, giving you room to grow while working on problems that matter.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
5,001-10,000 employees