Sr. Software Development Engineer - Orchestration Platform, Temporal, Fleet Management (Flexibility on level)

Zscaler•San Jose, CA

10d•Hybrid

About The Position

We are looking for a Software Engineer (Reliability) to join our team in San Jose, CA, reporting to the Vice President of Engineering. This is a hybrid role three days a week onsite within the Service Platform Automation department. You will build and operate the orchestration and reliability automation that manages ZIA’s fleet lifecycle at massive scale. This is a high-ownership role: you will design and implement orchestration workflows and the supporting services needed for safe, deterministic, idempotent fleet operations—while helping the team evolve toward AI-first execution and operations.

Requirements

BS/MS in Computer Science or a related technical field with 5+ years of experience building and operating production-grade software systems
Strong proficiency in backend/systems languages (Go, Java, C++, or Rust) with the ability to write high-quality, maintainable code
Deep experience designing and operating distributed systems, including concurrency, failure handling, performance optimization, and data modeling
Proven track record of building automation using REST APIs and Swagger with strong guarantees for idempotency, verification, and safe rollout patterns
Hands-on experience with cloud platforms (AWS/GCP, GKE, Cloud SQL etc.) and proficiency in containerization and CI/CD workflows using Docker and GitLab

Nice To Haves

Experience with Temporal (or similar platforms) to architect large-scale fleet systems for patching, upgrades, and remediation using deterministic, health-gated workflows and replay-safe designs
Testing discipline for orchestration and state machines, including E2E harnesses, determinism verification, fault injection, and chaos engineering to ensure system reliability
Proficiency in PostgreSQL, including SQL development and schema management, to power high-scale, stateful management-plane services and workflows

Responsibilities

Replace legacy Python/Ansible with a centralized, deterministic orchestration platform, refactoring automation into modular, well-defined workflows while retiring external dependencies and nested logic
Engineer execution patterns with retries, idempotency, rate limits/backpressure, and safe rollbacks/compensation designs aligned to global fleet capacity
Implement safe rollouts using segmentation, canaries, and automated health checks to limit blast radius during fleet-wide upgrades and remediation
Add strong observability and auditability (metrics, traces, replayable histories), participate in on-call rotation, and drive software based fixes to reduce toil following post-incident reviews
Integrate AI/LLM capabilities to accelerate legacy code migration and enhance safe operational outcomes through intelligent triage, correlation, and automated runbook generation