Senior Director, Site Reliability Engineering

Zscaler•San Jose, CA

4h•$231,000 - $330,000•Hybrid

About The Position

Zscaler accelerates digital transformation to ensure our customers can be more agile, efficient, resilient, and secure. As an AI-forward enterprise, we are constantly pushing the envelope, leveraging the world’s largest security data lake to power our cloud-native Zero Trust Exchange platform. This innovation protects our customers from cyberattacks and data loss by securely connecting users, devices, and applications in any location. Here, impact in your role matters more than title and trust is built on results. We say, impact over activity. We seek innovators who actively use AI to amplify their impact and who thrive in an environment where we leverage intelligent systems to stay ahead of evolving threats. We believe in transparency and value constructive, honest debate—we’re focused on getting to the best ideas, faster. We build high-performing teams that can make an impact quickly and with high quality. To do this, we are building a culture of execution centered on customer obsession, collaboration, ownership, and accountability. We value high-impact, high-accountability with a sense of urgency where you’re enabled to do your best work and embrace your potential. If you’re driven by purpose, thrive on solving complex challenges, and want to be part of the team that’s helping to secure the AI age, we invite you to bring your talents to Zscaler and help shape the future of cybersecurity. We are looking for a Senior Director, Production Engineering to join our team. This role is available as a hybrid opportunity 3 days a week in San Jose, CA or as a remote position, reporting to VP, Engineering in the Cloud Infrastructure & Operations department. Join Zscaler to set the strategic direction and lead the organization responsible for the reliability and operational excellence of our global platform protecting over 15 million users. In this role, you will define the long-term technical vision and operational strategy, leading high-priority investments to drive an "automation-first" culture and architect reliability into our next generation of products. You will mature observability standards and define company-wide SLIs/SLOs, acting as the executive owner for programs aligned to achieve availability goals and ensuring the scalability and resilience of our globally distributed, multi-cloud infrastructure.

Requirements

18+ years of relevant experience, including a minimum of 10+ years leading large-scale engineering or SRE organizations delivering mission-critical, production-grade systems.
Deep technical mastery and strategic understanding of distributed architecture, high-scale networking protocols, Linux systems, and multi-cloud environments (AWS, Azure, GCP).
Proven experience setting and executing an operational strategy across multiple departments, with a track record of significantly improving platform availability, performance, and MTTM.
Exceptional executive-level cross-functional leadership and communication skills, with demonstrated ability to influence product roadmaps and engineering culture across a global organization.
Strong production ownership experience: defining and meeting SLOs/SLIs, driving continuous reliability improvements, and managing high-stakes incident response programs.

Nice To Haves

Cloud Migration & Infrastructure-as-Code: Proven track record of successfully migrating large, complex systems to cloud-native architectures while leveraging IaC (Ansible, Terraform) at an enterprise scale to manage global infrastructure
Global Networking & L7 Proxy Architectures: Expertise in global routing (BGP, OSPF) and high-performance L7 proxy architectures (HAProxy, Envoy) within high-volume, multi-tenant environments
Resilience & Disaster Recovery: Deep experience with large-scale chaos engineering, resilience testing, and comprehensive disaster recovery planning

Responsibilities

Define the multi-year technical strategy and roadmap for Production Engineering, focusing on platform architecture, automation, and operational standards across AWS, Azure, GCP, and bare-metal environments.
Lead, mentor, and grow a high-performing SRE organization, including hiring and developing leaders (e.g., Directors and Principal SREs), and championing a culture of execution, accountability, and continuous improvement.
Drive the "AI-first" (and “automation-first”) mandate at an organizational level, sponsoring large-scale engineering initiatives to eliminate systemic toil and build advanced self-healing capabilities.
Establish and enforce enterprise-wide standards for observability (Prometheus, Grafana, OpenTelemetry), SLIs/SLOs, and error budget discipline across all engineering teams.
Serve as the executive owner for Service Health Reviews, partnering closely with Product, Security, and Engineering leadership to align priorities, manage critical dependencies, and drive company-wide maturity in post-incident analysis, systematic problem management, and the reliability of mission-critical customer-facing systems.