AI and Systems Software Intern, At Scale AI - Fall 2026

NVIDIA•Santa Clara, CA

7d•$20 - $71

About The Position

Our work at NVIDIA is dedicated towards a computing model focused on visual and AI computing. For two decades, NVIDIA has pioneered visual computing, the art and science of computer graphics, with our invention of the GPU. The GPU has also shown to be spectacularly effective at solving some of the most complex problems in computer science. Today, NVIDIA’s GPU simulates human intelligence, running deep learning algorithms and acting as the brain of computers, robots and self-driving cars that can perceive and understand the world. We are looking to grow our company and teams with the smartest people in the world and there has never been a more exciting time to join our team! NVIDIA is looking for an intern for an exciting role in AI and Systems Software for datacenter applications. You will be deeply involved in system-level debugging, analyzing our large-scale infrastructure reliability, and correlating complex failure modes to underlying hardware or system issues. We are working with the latest Accelerated Computing and Deep Learning software and hardware platforms, along with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. Our team interacts with OS, container technologies, GPU compute, and systems specialists to architect, develop and bring up large scale performance software components and optimize performance.

Requirements

Pursuing a BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field.
Proficiency in Python and Bash/Shell scripting for automation and tool development.
Proven debugging skills with an ability to isolate issues in complex, distributed systems.
Exposure to high-performance computing (HPC) environments, cluster managers (e.g., Slurm, Kubernetes), or large-scale distributed systems.

Nice To Haves

Familiarity with server architecture (PCIe, NVLink, CPU/GPU interactions) and hardware diagnostics.
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Familiarity with system profiling and debugging tools (e.g., strace, gdb, perf).
Experience running and analyzing standard industry benchmarks on Linux systems.
Desire to learn and be part of a committed and hardworking team with excellent collaboration and communication skills.
Ability to multitask effectively in a dynamic, high-performance environment.

Responsibilities

Investigate and triage failures within large-scale compute clusters, performing deep-dive analysis to distinguish between software glitches, configuration errors, and hardware faults.
Analyze logs and telemetry to correlate specific job failures to system-level issues and diagnostic test failures, helping to reduce noise and identify root causes.
Assist with the tracking, calculation, and reporting on key reliability metrics, specifically Mean Time Between Failures (MTBF) and Mean Time Between Interruptions (MTBI), to drive infrastructure improvements.
Assist in analyzing large-scale workload issues, searching for application and infrastructure improvement opportunities to ensure jobs run as fast and reliably as possible.
Work closely with a mentor to learn about hardware validation suite architecture, document debugging methodologies, and help the team make intelligent, data-backed engineering decisions.