Sr. Staff Production Engineer

Zscaler•San Jose, CA

1d•Hybrid

About The Position

Zscaler accelerates digital transformation to ensure customers are more agile, efficient, resilient, and secure. As an AI-forward enterprise, Zscaler leverages the world’s largest security data lake to power its cloud-native Zero Trust Exchange platform, protecting customers from cyberattacks and data loss by securely connecting users, devices, and applications anywhere. The company values impact over activity, seeking innovators who use AI to amplify their work and thrive in an environment that leverages intelligent systems. Zscaler fosters transparency, constructive debate, and builds high-performing teams focused on customer obsession, collaboration, ownership, and accountability. The company values high-impact, high-accountability with a sense of urgency, enabling employees to do their best work and embrace their potential. The Sr. Staff Production Engineer will join the team as a hybrid opportunity (3 days a week in San Jose, CA) or a remote position, reporting to Production Engineering in the Cloud Infrastructure & Operations department. This role is crucial for enhancing the reliability of a global platform protecting over 15 million users, providing technical vision and hands-on execution to drive an "automation-first" culture. By maturing observability and architectural standards, the engineer will directly reduce Mean Time to Mitigate (MTTM) and shape the scalability of Zscaler's globally distributed, multi-cloud infrastructure.

Requirements

8+ years of experience managing reliability, scalability, and availability for large-scale production services
Deep expertise in programming (e.g., Python, Go, or C/C++)
Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
Experience in high-stakes incident management and participation in a 24/7 on-call rotation
Proficiency in leveraging ITIL frameworks and incident data to drive service maturity through systematic problem management and technical operability reviews

Nice To Haves

Extensive experience with public cloud (AWS, Azure, GCP) and Infrastructure-as-Code (Ansible, Terraform)
Experience with chaos engineering and disaster recovery planning at scale
Expertise in global routing (BGP) and traffic tunneling (GRE, IPSec) with a deep understanding of L7 proxy architectures (HAProxy), DNS at scale, and OS networking stack internals

Responsibilities

Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems
Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets
Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses
Partner with Engineering and partner teams to conduct operability reviews