Sr. Staff Software Engineer (Reliability)

ZscalerSan Jose, CA
$176,000 - $220,000Hybrid

About The Position

We are looking for a Sr. Staff Software Engineer to join our Service Platform Automation team. This role offers flexibility to work a hybrid schedule (three days a week onsite) in San Jose, CA, reporting to the VP of Engineering. In this high-ownership position, you will build and operate the orchestration and reliability automation that manages ZIA’s fleet lifecycle at massive scale. You will initially focus on leading the architectural transformation of legacy scripts into a safe, deterministic, Temporal-based orchestration platform to achieve "one-touch" provisioning. As you scale the platform, you will expand the team’s mission into AI SRE practices, applying software engineering to identify and solve systemic inefficiencies and build self-healing capabilities across our global fleet.

Requirements

  • BS or MS in Computer Science or a related technical field with 10+ years of experience in hyperscale systems, with a deep understanding of the unique failure modes and technical hurdles that only emerge at massive scale
  • Mastery of backend systems languages (Go, Java, Python, or others) with a proven ability to set the bar for code quality, maintainability, and distributed system correctness
  • Experience designing and operating complex distributed systems, with a focus on solving systemic challenges in concurrency, failure handling, and performance optimization
  • Expertise in building automation using REST APIs and Swagger with strong guarantees for idempotency, verification, and safe rollout patterns
  • Expertise in engineering and operating hybrid infrastructure across cloud platforms (AWS/GCP, GKE) and on-premise environments, ensuring consistent container orchestration and CI/CD safety

Nice To Haves

  • Experience building or operating AI-enabled developer/ops tooling with measurable improvements in triage speed and operational efficiency
  • Experience in testing orchestration systems, including determinism verification, fault injection, and chaos engineering
  • Proficiency in PostgreSQL, including SQL development and schema management, to power high-scale, stateful management-plane services

Responsibilities

  • Drive the migration from legacy scripts to a Temporal-based platform, engineering replay-safe workflows with built-in retries, idempotency, and safe rollback designs for one-touch fleet operations
  • Identify and solve systemic inefficiencies across our global fleet, engineering technical solutions needed to make our operations more autonomous
  • Build systems that leverage LLMs and ML for intelligent triage, global signal correlation, and automated runbooks to eliminate manual toil
  • Develop framework-type services for feature teams, ensuring all new products are delivered "automation-ready" with reliability hooks built directly into the code
  • Ensure every fleet-wide action is fully explainable, replayable, and auditable by implementing comprehensive metrics, traces, and event logging

Benefits

  • Various health plans
  • Time off plans for vacation and sick time
  • Parental leave options
  • Retirement options
  • Education reimbursement
  • In-office perks
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service