Sr. Staff Software Engineer (Reliability)

Zscaler•San Jose, CA

18h•$176,000 - $220,000•Hybrid

About The Position

We are looking for a Sr. Staff Software Engineer to join our Service Platform Automation team. This role offers flexibility to work a hybrid schedule (three days a week onsite) in San Jose, CA, reporting to the VP of Engineering. In this high-ownership position, you will build and operate the orchestration and reliability automation that manages ZIA’s fleet lifecycle at massive scale. You will initially focus on leading the architectural transformation of legacy scripts into a safe, deterministic, Temporal-based orchestration platform to achieve "one-touch" provisioning. As you scale the platform, you will expand the team’s mission into AI SRE practices, applying software engineering to identify and solve systemic inefficiencies and build self-healing capabilities across our global fleet.

Requirements

BS or MS in Computer Science or a related technical field with 10+ years of experience in hyperscale systems, with a deep understanding of the unique failure modes and technical hurdles that only emerge at massive scale
Mastery of backend systems languages (Go, Java, Python, or others) with a proven ability to set the bar for code quality, maintainability, and distributed system correctness
Experience designing and operating complex distributed systems, with a focus on solving systemic challenges in concurrency, failure handling, and performance optimization
Expertise in building automation using REST APIs and Swagger with strong guarantees for idempotency, verification, and safe rollout patterns
Expertise in engineering and operating hybrid infrastructure across cloud platforms (AWS/GCP, GKE) and on-premise environments, ensuring consistent container orchestration and CI/CD safety

Nice To Haves

Experience building or operating AI-enabled developer/ops tooling with measurable improvements in triage speed and operational efficiency
Experience in testing orchestration systems, including determinism verification, fault injection, and chaos engineering
Proficiency in PostgreSQL, including SQL development and schema management, to power high-scale, stateful management-plane services

Responsibilities

Drive the migration from legacy scripts to a Temporal-based platform, engineering replay-safe workflows with built-in retries, idempotency, and safe rollback designs for one-touch fleet operations
Identify and solve systemic inefficiencies across our global fleet, engineering technical solutions needed to make our operations more autonomous
Build systems that leverage LLMs and ML for intelligent triage, global signal correlation, and automated runbooks to eliminate manual toil
Develop framework-type services for feature teams, ensuring all new products are delivered "automation-ready" with reliability hooks built directly into the code
Ensure every fleet-wide action is fully explainable, replayable, and auditable by implementing comprehensive metrics, traces, and event logging