Sr. Staff Software Engineer (Reliability)

Zscaler•San Jose, CA

15h•Hybrid

About The Position

About Zscaler Zscaler accelerates digital transformation to ensure our customers can be more agile, efficient, resilient, and secure. As an AI-forward enterprise , we are constantly pushing the envelope, leveraging the world’s largest security data lake to power our cloud-native Zero Trust Exchange platform. This innovation protects our customers from cyberattacks and data loss by securely connecting users, devices, and applications in any location. Here, impact in your role matters more than title and trust is built on results. We say, impact over activity. We seek innovators who actively use AI to amplify their impact and who thrive in an environment where we leverage intelligent systems to stay ahead of evolving threats. We believe in transparency and value constructive, honest debate —we’re focused on getting to the best ideas, faster. We build high-performing teams that can make an impact quickly and with high quality. To do this, we are building a culture of execution centered on customer obsession , collaboration, ownership, and accountability. We value high-impact, high-accountability with a sense of urgency where you’re enabled to do your best work and embrace your potential. If you’re driven by purpose, thrive on solving complex challenges, and want to be part of the team that’s helping to secure the AI age, we invite you to bring your talents to Zscaler and help shape the future of cybersecurity. Role We are looking for a Sr. Staff Software Engineer to join our Service Platform Automation team. This role offers flexibility to work a hybrid schedule (three days a week onsite) in San Jose, CA, reporting to the VP of Engineering. In this high-ownership position, you will build and operate the orchestration and reliability automation that manages ZIA’s fleet lifecycle at massive scale. You will initially focus on leading the architectural transformation of legacy scripts into a safe, deterministic, Temporal-based orchestration platform to achieve "one-touch" provisioning. As you scale the platform, you will expand the team’s mission into AI SRE practices, applying software engineering to identify and solve systemic inefficiencies and build self-healing capabilities across our global fleet.

Requirements

BS or MS in Computer Science or a related technical field with 10+ years of experience in hyperscale systems, with a deep understanding of the unique failure modes and technical hurdles that only emerge at massive scale
Mastery of backend systems languages (Go, Java, Python, or others) with a proven ability to set the bar for code quality, maintainability, and distributed system correctness
Experience designing and operating complex distributed systems, with a focus on solving systemic challenges in concurrency, failure handling, and performance optimization
Expertise in building automation using REST APIs and Swagger with strong guarantees for idempotency, verification, and safe rollout patterns
Expertise in engineering and operating hybrid infrastructure across cloud platforms (AWS/GCP, GKE) and on-premise environments, ensuring consistent container orchestration and CI/CD safety

Nice To Haves

Experience building or operating AI-enabled developer/ops tooling with measurable improvements in triage speed and operational efficiency
Experience in testing orchestration systems, including determinism verification, fault injection, and chaos engineering
Proficiency in PostgreSQL, including SQL development and schema management, to power high-scale, stateful management-plane services

Responsibilities

Drive the migration from legacy scripts to a Temporal-based platform, engineering replay-safe workflows with built-in retries, idempotency, and safe rollback designs for one-touch fleet operations
Identify and solve systemic inefficiencies across our global fleet, engineering technical solutions needed to make our operations more autonomous
Build systems that leverage LLMs and ML for intelligent triage, global signal correlation, and automated runbooks to eliminate manual toil
Develop framework-type services for feature teams, ensuring all new products are delivered "automation-ready" with reliability hooks built directly into the code
Ensure every fleet-wide action is fully explainable, replayable, and auditable by implementing comprehensive metrics, traces, and event logging