Principal Site Reliability Engineer

Zscaler•San Jose, CA

11h•$164,500 - $235,000•Hybrid

About The Position

We are looking for a Principal Site Reliability Engineer to join our team. This role is available as a hybrid opportunity 3 days a week in San Jose, CA or as a remote position, reporting to Production Engineering in the Cloud Infrastructure & Operations department. Join Zscaler to be a force multiplier for the reliability of a global platform protecting over 15 million users. In this role, you will provide the technical vision and hands-on execution to drive an "automation-first" culture across the company. By maturing our observability and architectural standards, you will directly reduce our Mean Time to Mitigate (MTTM) and shape the scalability of our globally distributed, multi-cloud infrastructure.

Requirements

10+ years of experience managing reliability, scalability, and availability for large-scale production services
Deep expertise in programming (e.g., Python, Go, or C/C++)
Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
Experience in high-stakes incident management and participation in a 24/7 on-call rotation
Proficiency in leveraging ITIL frameworks and incident data to drive service maturity through systematic problem management and technical operability reviews

Nice To Haves

Extensive experience with public cloud (AWS, Azure, GCP) and Infrastructure-as-Code (Ansible, Terraform)
Experience with chaos engineering and disaster recovery planning at scale
Expertise in global routing (BGP) and traffic tunneling (GRE, IPSec) with a deep understanding of L7 proxy architectures (HAProxy), DNS at scale, and OS networking stack internals

Responsibilities

Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems
Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets
Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses
Partner with Engineering and partner teams to conduct operability reviews