Staff Escalation Engineer (DevOps)

ZscalerSan Jose, CA
8dHybrid

About The Position

About Zscaler Zscaler accelerates digital transformation so our customers can be more agile, efficient, resilient, and secure. Our cloud native Zero Trust Exchange platform protects thousands of customers from cyberattacks and data loss by securely connecting users, devices, and applications in any location. Here, impact in your role matters more than title and trust is built on results. We believe in transparency and value constructive, honest debate —we’re focused on getting to the best ideas, faster. We build high-performing teams that can make an impact quickly and with high quality. To do this, we are building a culture of execution centered on customer obsession, collaboration, ownership and accountability. We value high-impact, high-accountability with a sense of urgency where you’re enabled to do your best work and embrace your potential. If you’re driven by purpose, thrive on solving complex challenges and want to make a positive difference on a global scale, we invite you to bring your talents to Zscaler and help shape the future of cybersecurity. Our Engineering team built the world’s largest cloud security platform from the ground up, and we keep building. With more than 100 patents and big plans for enhancing services and increasing our global footprint, the team has made us and our multitenant architecture today's cloud security leader, with more than 65 million users in 185 countries. Bring your vision and passion to our team of cloud architects, software engineers, security experts, and more who are enabling organizations worldwide to harness speed and agility with a cloud-first strategy. As a Staff Escalation DevOps Engineer within the Shared Platform Services team, you will be instrumental in taking the Zscaler Client Connector Cloud service to the next level in terms of Reliability, Availability and Scalability. This is a hybrid role, reporting in the San Jose, CA office 3 days a week. Reporting to the Sr. Manager, Software Engineering QA, you'll be responsible for: Owning and resolving escalated cloud incidents end-to-end, including impact analysis, debugging, implementing solutions, and communicating with stakeholders Collaborating with development, security, and operations to design and implement code/configuration fixes for complex system issues Monitoring system health, performance, and security via PagerDuty; enhance alerting to meet SLOs Building diagnostic tools, dashboards, and documentation to enable faster, more effective incident resolution across the team Leading production service ownership and supportability by deploying critical fixes, making key deployment decisions, responding to high-pressure off-hours events, and contributing positively to team culture

Requirements

  • Expert troubleshooting, debugging, and root-cause analysis for complex, high-priority incidents, with 5+ years of live production triage using CPU/memory profilers to diagnose resource exhaustion
  • Strong hands-on skills in Python, Bash, and Java; cloud platforms (GCP, AWS, Azure); and IaC/configuration tools (Terraform, Ansible), with experience deploying production releases to GCP and data centers via CI/CD
  • Must have the ability to write complex MySQL queries and generate business reports
  • Experience with authentication protocols such as SAML and OAuth
  • Solid networking fundamentals (TCP/IP, UDP, ICMP) and debugging with Postman and packet captures; proficient with monitoring/alerting tools (Grafana, Klodfuse) and building dashboards for key service metrics

Nice To Haves

  • Experience developing applications with Java, REST APIs, MySQL, and jQuery
  • Bachelor’s degree in Computer Science or Computer Engineering (or related field); MS in CS/CE preferred

Responsibilities

  • Owning and resolving escalated cloud incidents end-to-end, including impact analysis, debugging, implementing solutions, and communicating with stakeholders
  • Collaborating with development, security, and operations to design and implement code/configuration fixes for complex system issues
  • Monitoring system health, performance, and security via PagerDuty; enhance alerting to meet SLOs
  • Building diagnostic tools, dashboards, and documentation to enable faster, more effective incident resolution across the team
  • Leading production service ownership and supportability by deploying critical fixes, making key deployment decisions, responding to high-pressure off-hours events, and contributing positively to team culture

Benefits

  • Various health plans
  • Time off plans for vacation and sick time
  • Parental leave options
  • Retirement options
  • Education reimbursement
  • In-office perks
  • Learn more about Zscaler’s Future of Work strategy, hybrid working model, and benefits here
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service