Staff Site Reliability Engineer

Zscaler•San Jose, CA

12d•$119,000 - $170,000•Hybrid

About The Position

We are looking for a Staff Site Reliability Engineer to join our team. This role will report to the Senior Manager, Site Reliability Engineering and offers the flexibility of hybrid (3 days a week) out of San Jose, CA, or can be performed fully remote. As a key member of the Zero Trust Exchange team, you will be responsible for all aspects of the Zscaler production data center services, including servers, operating systems, storage, and supporting systems. You will be an instrumental part of the Site Reliability Engineering team, ensuring the availability, latency, performance, efficiency, and scalability of a cloud that processes tens of billions of transactions daily.

Requirements

US Citizenship is required (due to the nature of assigned customers) and 5+ years industry experience in software engineering, infrastructure software, and/or platform engineering
Proficiency in at least one programming language (such as Python, Bash, or Go) with demonstrated ability to write production-quality code (testing, code reviews, CI, maintainable design,scripting for diagnostics
Strong Linux/Unix systems fundamentals (process/memory, filesystems, networking stack basics, debugging/perf troubleshooting) and solid understanding of networking protocols and components (e.g., HTTP, DNS, TCP/IP, ICMP, OSI model, subnetting, and load balancing/traffic concepts)
Proven experience operating production services (including incident response, troubleshooting, reducing toil) and ability to participate in on-call rotations and support occasional after-hours or weekend deployments
Managing BSD in production, with a focus on driving systemic fixes through platform engineering

Nice To Haves

Proven expertise in operating Kubernetes at scale
Deep experience with the Prometheus/OpenTelemetry ecosystems, including instrumenting golden signals, defining SLOs, and performing alert tuning to ensure high-availability environments

Responsibilities

Own the reliability of a large-scale cloud service (Linux/BSD, bare metal, Kubernetes, custom load balancing, SD-WAN) by partnering with Engineering and Network teams to define requirements early, conduct operability reviews, and contribute code/design docs for platform resilience
Develop and operate end-to-end observability (metrics/logs/traces, dashboards, alerting) and incident tooling to manage SLOs/error budgets, reduce noise, and improve system detection and diagnosis
Participate in an on-call rotation to lead full-cycle incident response; perform deep cross-stack troubleshooting (OS, networking, distributed systems, packet captures, core dumps) to drive permanent software fixes and codify learnings into runbooks and tests
Build and maintain everything-as-code for fleet and service lifecycle, driving provisioning, configuration, release automation, canary deployments, and complex rollout/rollback workflows
Continuously improve platform hygiene through consistent OS/app upgrades, dependency/vulnerability patching, capacity and performance tuning, and strict CI/CD validation prior to production rollouts