Sr. Staff Production Engineer

ZscalerSan Jose, CA
Hybrid

About The Position

Zscaler accelerates digital transformation to ensure customers are more agile, efficient, resilient, and secure. As an AI-forward enterprise, Zscaler leverages the world’s largest security data lake to power its cloud-native Zero Trust Exchange platform, protecting customers from cyberattacks and data loss by securely connecting users, devices, and applications anywhere. The company values impact over activity, seeking innovators who use AI to amplify their work and thrive in an environment that leverages intelligent systems. Zscaler fosters transparency, constructive debate, and builds high-performing teams focused on customer obsession, collaboration, ownership, and accountability. The company values high-impact, high-accountability with a sense of urgency, enabling employees to do their best work and embrace their potential. The Sr. Staff Production Engineer will join the team as a hybrid opportunity (3 days a week in San Jose, CA) or a remote position, reporting to Production Engineering in the Cloud Infrastructure & Operations department. This role is crucial for enhancing the reliability of a global platform protecting over 15 million users, providing technical vision and hands-on execution to drive an "automation-first" culture. By maturing observability and architectural standards, the engineer will directly reduce Mean Time to Mitigate (MTTM) and shape the scalability of Zscaler's globally distributed, multi-cloud infrastructure.

Requirements

  • 8+ years of experience managing reliability, scalability, and availability for large-scale production services
  • Deep expertise in programming (e.g., Python, Go, or C/C++)
  • Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
  • Experience in high-stakes incident management and participation in a 24/7 on-call rotation
  • Proficiency in leveraging ITIL frameworks and incident data to drive service maturity through systematic problem management and technical operability reviews

Nice To Haves

  • Extensive experience with public cloud (AWS, Azure, GCP) and Infrastructure-as-Code (Ansible, Terraform)
  • Experience with chaos engineering and disaster recovery planning at scale
  • Expertise in global routing (BGP) and traffic tunneling (GRE, IPSec) with a deep understanding of L7 proxy architectures (HAProxy), DNS at scale, and OS networking stack internals

Responsibilities

  • Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
  • Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems
  • Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets
  • Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses
  • Partner with Engineering and partner teams to conduct operability reviews

Benefits

  • Various health plans
  • Time off plans for vacation and sick time
  • Parental leave options
  • Retirement options
  • Education reimbursement
  • In-office perks

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Principal

Education Level

No Education Listed

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service