Site Reliability Engineer - Air Platform Team

Nvidia•Durham, NC

127d•$148,000 - $235,750

About The Position

NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It's an outstanding legacy of innovation driven by extraordinary technology and amazing people. NVIDIA is looking for a highly motivated SRE Engineer to join the NVIDIA AIR team - the Digital Twin for Data Center Simulation web application. NVIDIA AIR enables cloud-scale efficiency by creating identical replicas of real-world data center infrastructure deployments.

Requirements

BS degree in Computer Science, Software Engineering, or a related field (or equivalent experience).
5+ years of experience in a Site Reliability, DevOps, or Systems Engineering role.
Strong automation and scripting skills in Ansible, Python, and Shell Scripting.
Experience in IaaS environments, including deploying, configuring, and administering Linux-based bare metal servers.
Deep experience in infrastructure engineering, focused on managing and monitoring a highly available production infrastructure.
Skilled in observability practices, using Prometheus, Grafana, ELK/EFK, and integrated alerting systems.
Solid grasp of Linux internals and core networking concepts including NAT, DNS, DHCP, routing, and firewall configuration with iptables or nftables.
Experience with modern deployment architecture for non-disruptive cloud operations, including blue-green and canary rollouts.
Proficiency in Kubernetes, Docker, QEMU, and Libvirt.

Nice To Haves

Hands-on expertise with AWS, including deploying complex, load-balanced, and highly available workloads.
Proficiency in debugging network issues in both infrastructure and SDN.
Experience with performance tuning and benchmarking across storage, compute, or networking.
Implemented robust metrics collection and alerting infrastructure.
Familiar with compliance standards such as FedRAMP, HIPAA, and SOC 2.

Responsibilities

Design, deploy, and manage IaaS platforms with a focus on high availability and performance.
Automate infrastructure operations using tools like Terraform, Ansible, and Python.
Focus on efficiency by automating repetitive workflows.
Develop monitoring and observability tooling to detect and prevent outages using Prometheus, Grafana, ELK, etc.
Deploy and troubleshoot non-disruptive cloud operations with an emphasis on secure production infrastructure.
Manage deployment/upgrades for Operating Systems, Kubernetes (k8s) clusters, and other orchestration tools.
Provide day-to-day support for engineering activities with CI/CD tools like Git and Jenkins.
Implement and enforce best practices around infrastructure security, access control, and operational efficiency.

Benefits

Competitive salaries
Generous benefits package
Equity opportunities

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Computer and Electronic Product Manufacturing

Education Level

Bachelor's degree

Site Reliability Engineer - Air Platform Team

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company