Sr. Software Development Engineer - Control Plane, Reliability, Backend (Flexibility on level)

Zscaler•San Jose, CA

9h•$112,000 - $160,000•Hybrid

About The Position

Zscaler accelerates digital transformation to ensure our customers can be more agile, efficient, resilient, and secure. As an AI-forward enterprise, we are constantly pushing the envelope, leveraging the world’s largest security data lake to power our cloud-native Zero Trust Exchange platform. This innovation protects our customers from cyberattacks and data loss by securely connecting users, devices, and applications in any location. Here, impact in your role matters more than title and trust is built on results. We say, impact over activity. We seek innovators who actively use AI to amplify their impact and who thrive in an environment where we leverage intelligent systems to stay ahead of evolving threats. We believe in transparency and value constructive, honest debate—we’re focused on getting to the best ideas, faster. We build high-performing teams that can make an impact quickly and with high quality. To do this, we are building a culture of execution centered on customer obsession, collaboration, ownership, and accountability. We value high-impact, high-accountability with a sense of urgency where you’re enabled to do your best work and embrace your potential. If you’re driven by purpose, thrive on solving complex challenges, and want to be part of the team that’s helping to secure the AI age, we invite you to bring your talents to Zscaler and help shape the future of cybersecurity. We are looking for a Software Engineer (Reliability) to join our team in San Jose, CA, reporting to the Vice President of Engineering. This is a hybrid role three days a week onsite within the Service Platform Automation department. You will build and operate the orchestration and reliability automation that manages ZIA’s fleet lifecycle at massive scale. This is a high-ownership role: you will design and implement orchestration workflows and the supporting services needed for safe, deterministic, idempotent fleet operations—while helping the team evolve toward AI-first execution and operations.

Requirements

BS/MS in Computer Science or a related technical field with 5+ years of experience building and operating production-grade software systems
Strong proficiency in backend/systems languages (Go, Java, C++, or Rust) with the ability to write high-quality, maintainable code
Deep experience designing and operating distributed systems, including concurrency, failure handling, performance optimization, and data modeling
Proven track record of building automation using REST APIs and Swagger with strong guarantees for idempotency, verification, and safe rollout patterns
Hands-on experience with cloud platforms (AWS/GCP, GKE, Cloud SQL etc.) and proficiency in containerization and CI/CD workflows using Docker and GitLab

Nice To Haves

Experience with Temporal (or similar platforms) to architect large-scale fleet systems for patching, upgrades, and remediation using deterministic, health-gated workflows and replay-safe designs
Testing discipline for orchestration and state machines, including E2E harnesses, determinism verification, fault injection, and chaos engineering to ensure system reliability
Proficiency in PostgreSQL, including SQL development and schema management, to power high-scale, stateful management-plane services and workflows

Responsibilities

Replace legacy Python/Ansible with a centralized, deterministic orchestration platform, refactoring automation into modular, well-defined workflows while retiring external dependencies and nested logic
Engineer execution patterns with retries, idempotency, rate limits/backpressure, and safe rollbacks/compensation designs aligned to global fleet capacity
Implement safe rollouts using segmentation, canaries, and automated health checks to limit blast radius during fleet-wide upgrades and remediation
Add strong observability and auditability (metrics, traces, replayable histories), participate in on-call rotation, and drive software based fixes to reduce toil following post-incident reviews
Integrate AI/LLM capabilities to accelerate legacy code migration and enhance safe operational outcomes through intelligent triage, correlation, and automated runbook generation