Staff Escalation Engineer (DevOps)

Zscaler•San Jose, CA

55d•Hybrid

About The Position

About Zscaler Zscaler accelerates digital transformation so our customers can be more agile, efficient, resilient, and secure. Our cloud native Zero Trust Exchange platform protects thousands of customers from cyberattacks and data loss by securely connecting users, devices, and applications in any location. Here, impact in your role matters more than title and trust is built on results. We believe in transparency and value constructive, honest debate —we’re focused on getting to the best ideas, faster. We build high-performing teams that can make an impact quickly and with high quality. To do this, we are building a culture of execution centered on customer obsession, collaboration, ownership and accountability. We value high-impact, high-accountability with a sense of urgency where you’re enabled to do your best work and embrace your potential. If you’re driven by purpose, thrive on solving complex challenges and want to make a positive difference on a global scale, we invite you to bring your talents to Zscaler and help shape the future of cybersecurity. Our Engineering team built the world’s largest cloud security platform from the ground up, and we keep building. With more than 100 patents and big plans for enhancing services and increasing our global footprint, the team has made us and our multitenant architecture today's cloud security leader, with more than 65 million users in 185 countries. Bring your vision and passion to our team of cloud architects, software engineers, security experts, and more who are enabling organizations worldwide to harness speed and agility with a cloud-first strategy. As a Staff Escalation DevOps Engineer within the Shared Platform Services team, you will be instrumental in taking the Zscaler Client Connector Cloud service to the next level in terms of Reliability, Availability and Scalability. This is a hybrid role, reporting in the San Jose, CA office 3 days a week. Reporting to the Sr. Manager, Software Engineering QA, you'll be responsible for: Owning and resolving escalated cloud incidents end-to-end, including impact analysis, debugging, implementing solutions, and communicating with stakeholders Collaborating with development, security, and operations to design and implement code/configuration fixes for complex system issues Monitoring system health, performance, and security via PagerDuty; enhance alerting to meet SLOs Building diagnostic tools, dashboards, and documentation to enable faster, more effective incident resolution across the team Leading production service ownership and supportability by deploying critical fixes, making key deployment decisions, responding to high-pressure off-hours events, and contributing positively to team culture

Requirements

Expert troubleshooting, debugging, and root-cause analysis for complex, high-priority incidents, with 5+ years of live production triage using CPU/memory profilers to diagnose resource exhaustion
Strong hands-on skills in Python, Bash, and Java; cloud platforms (GCP, AWS, Azure); and IaC/configuration tools (Terraform, Ansible), with experience deploying production releases to GCP and data centers via CI/CD
Must have the ability to write complex MySQL queries and generate business reports
Experience with authentication protocols such as SAML and OAuth
Solid networking fundamentals (TCP/IP, UDP, ICMP) and debugging with Postman and packet captures; proficient with monitoring/alerting tools (Grafana, Klodfuse) and building dashboards for key service metrics

Nice To Haves

Experience developing applications with Java, REST APIs, MySQL, and jQuery
Bachelor’s degree in Computer Science or Computer Engineering (or related field); MS in CS/CE preferred

Responsibilities

Owning and resolving escalated cloud incidents end-to-end, including impact analysis, debugging, implementing solutions, and communicating with stakeholders
Collaborating with development, security, and operations to design and implement code/configuration fixes for complex system issues
Monitoring system health, performance, and security via PagerDuty; enhance alerting to meet SLOs
Building diagnostic tools, dashboards, and documentation to enable faster, more effective incident resolution across the team
Leading production service ownership and supportability by deploying critical fixes, making key deployment decisions, responding to high-pressure off-hours events, and contributing positively to team culture

Benefits

Various health plans
Time off plans for vacation and sick time
Parental leave options
Retirement options
Education reimbursement
In-office perks
Learn more about Zscaler’s Future of Work strategy, hybrid working model, and benefits here

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume