Senior SRE Engineer

NVIDIA•Santa Clara, CA

1d•$148,000 - $276,000

About The Position

NVIDIA is seeking a passionate, motivated, and technical Engineer to join its multifaceted and fast-paced Infrastructure, Planning, and Processes organization as a Senior SRE Engineer. You will own and scale our internal CI as a Service platform. This platform includes the shared GitLab CI and GitHub Actions infrastructure used daily by thousands of engineers. You will manage this platform like a product: highly available, self-service, observable, and elastic to handle build and test workloads across the company. The position is part of a fast-paced team that develops and maintains complex build and test environments. These environments support various hardware platforms, including NVIDIA GPUs and Tegra Processors, as well as multiple operating systems like Windows, Linux, and Android. The team collaborates with other NVIDIA Software units such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, Robotics, and Autonomous cars to meet their infrastructure and system needs.

Requirements

5+ years in SRE/platform roles with strong fundamentals — SLO/SLI build, incident command, resource planning, performance tuning, and production Linux administration at scale.
Deep Kubernetes administration experience: CRDs and operators, HPA/VPA/cluster-autoscaling, ingress, service mesh, RBAC, network policies, storage classes and deep problem-solving skills.
Hands-on expertise with GitLab continuous integration and GitHub automated workflows at scale — runner architecture, executor tuning, self-hosted runner controllers (ARC, GitLab-runner Helm chart), cache and artifact strategy, and pipeline development involving DAGs or equivalent experience.
Strong scripting and automation skills in Python, Go, bash scripting or equivalent.
Production experience with IaC and configuration management tools like Terraform, Helm, and Ansible.
Experience with GitOps tools such as Argo CD and Flux is also required.
BS/MS in CS or equivalent experience in building observability tools like Prometheus, Grafana, Loki/ELK, OpenTelemetry or similar products.
You have shipped platforms that other specialists enjoy using.

Nice To Haves

Strong understanding of containerization and microservices architecture.
Certified Kubernetes Administrator (CKA), Certified Kubernetes Security Specialist (CKS) & Certified Kubernetes Application Developer (CKAD) preferred.
Built or extended the CI control plane itself — custom runner schedulers, autoscaling, webhooks routers, or pipeline orchestration on top of GitLab/GitHub APIs.
Thrives in a multi-tasking environment with continuously evolving priorities.
Ability to analyze complex problems into simple sub problems and then reuse available solutions to implement most of those.
Ability to build simple systems that can work efficiently without needing much support.
Prior experience with large scale operations team.
Experience with using and improving data centers.
Background with computer algorithms and ability to choose the best possible algorithms to meet the scaling challenge.

Responsibilities

Develop, handle, and expand a multi-tenant CI platform built on GitLab’s CI framework and GitHub’s action-based automation, encompassing runner fleets, shared caches, artifact storage, and secrets brokering.
Own the underlying Kubernetes substrate end-to-end. This includes cluster lifecycle, upgrades, and autoscaling.
Manage node pools for GPU, CPU, and ARM workloads.
Handle network and storage policy.
Operate the controllers and operators that schedule runner pods on demand.
Drive reliability and capacity engineering: SLOs and error budgets for queue time, job success, and runner availability; on-call, incident response, postmortems, and structural fixes that keep toil flat as usage grows.
Build the self-service layer pipeline templates, reusable workflows, golden images, policy-as-code, and guardrails so product teams onboard in hours, not weeks, with secure-by-default pipelines.
Improve developer experience continuously: faster cold-starts, smarter caching, hermetic builds, test sharding and flakiness reduction, and deep observability into pipeline performance and cost per team.