About The Position

We are looking for a Software Engineer (Reliability) to join our team in San Jose, CA, reporting to the Vice President of Engineering. This is a hybrid role three days a week onsite within the Service Platform Automation department. You will build and operate the orchestration and reliability automation that manages ZIA’s fleet lifecycle at massive scale. This is a high-ownership role: you will design and implement orchestration workflows and the supporting services needed for safe, deterministic, idempotent fleet operations—while helping the team evolve toward AI-first execution and operations.

Requirements

  • BS/MS in Computer Science or a related technical field with 5+ years of experience building and operating production-grade software systems
  • Strong proficiency in backend/systems languages (Go, Java, C++, or Rust) with the ability to write high-quality, maintainable code
  • Deep experience designing and operating distributed systems, including concurrency, failure handling, performance optimization, and data modeling
  • Proven track record of building automation using REST APIs and Swagger with strong guarantees for idempotency, verification, and safe rollout patterns
  • Hands-on experience with cloud platforms (AWS/GCP, GKE, Cloud SQL etc.) and proficiency in containerization and CI/CD workflows using Docker and GitLab

Nice To Haves

  • Experience with Temporal (or similar platforms) to architect large-scale fleet systems for patching, upgrades, and remediation using deterministic, health-gated workflows and replay-safe designs
  • Testing discipline for orchestration and state machines, including E2E harnesses, determinism verification, fault injection, and chaos engineering to ensure system reliability
  • Proficiency in PostgreSQL, including SQL development and schema management, to power high-scale, stateful management-plane services and workflows

Responsibilities

  • Replace legacy Python/Ansible with a centralized, deterministic orchestration platform, refactoring automation into modular, well-defined workflows while retiring external dependencies and nested logic
  • Engineer execution patterns with retries, idempotency, rate limits/backpressure, and safe rollbacks/compensation designs aligned to global fleet capacity
  • Implement safe rollouts using segmentation, canaries, and automated health checks to limit blast radius during fleet-wide upgrades and remediation
  • Add strong observability and auditability (metrics, traces, replayable histories), participate in on-call rotation, and drive software based fixes to reduce toil following post-incident reviews
  • Integrate AI/LLM capabilities to accelerate legacy code migration and enhance safe operational outcomes through intelligent triage, correlation, and automated runbook generation

Benefits

  • Various health plans
  • Time off plans for vacation and sick time
  • Parental leave options
  • Retirement options
  • Education reimbursement
  • In-office perks, and more!
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service