Infrastructure Systems Engineer

NVIDIA•Santa Clara, CA

1d•$124,000 - $195,500•Hybrid

About The Position

NVIDIA’s Kernel Infrastructure team is looking for a Hands-On Systems Engineer to manage environment readiness, configuration, and long-term health of our next-generation GPU platforms. You will own the key lifecycle phase where early production hardware meets software. Your role ensures our innovative systems are stable, optimized, and continuously maintained for engineering teams. If you love being hands-on with early-stage computing platforms, debugging complex hardware-to-software environments, and owning the operational stability of fast-evolving infrastructure, join us in Santa Clara, CA.

Requirements

Degree in Computer Engineering, Electrical Engineering, Computer Science, or equivalent experience.
3+ years in systems engineering, infrastructure operations, or hardware validation environments handling early-stage platforms.
Deep Linux and Windows system administration with strong debugging capabilities across the hardware-to-software stack.
Proficiency in scripting and automation (Shell scripting, Python, Ansible etc.).
Hands-on experience with Slurm, Kubernetes, or other cluster management platforms.
Strong, clear written and verbal communication skills, including the ability to explain complex technical concepts to non-technical audiences.
Strong problem-solving skills and a collaborative approach.
Self-motivated individual and a great teammate.

Nice To Haves

Experience managing HPC clusters at scale.
A proven track record of configuring and maintaining bring-up systems and early hardware prototypes.
Demonstrated technical curiosity and a drive to innovate.
Mechanically inclined and comfortable with tools and hands-on physical work.
Positive and cooperative, with the determination to help us reach the finish line.

Responsibilities

Drive early-stage engineering systems to a performance-ready state. Handle firmware/VBIOS flashing, core clock configurations, power-state enablement, and system tuning.
Act as the first line of defense for complex system and environment-level issues, coordinating directly with firmware, hardware design, and platform teams to unblock engineering.
Monitor and optimize the ongoing health of the hardware fleet. Implement proactive health checks, diagnose degrading systems, and provide manual recovery when automated workflows fall short.
Establish and detail the "golden" system baselines (drivers, firmware, configurations) required for stable engineering execution as the product evolves. Track hardware inventory and manage demands from engineering teams to improve hardware utilization.