Staff Site Reliability Engineer

Zscaler•San Jose, CA

16h•Hybrid

About The Position

Zscaler is a pioneer and global leader in zero trust security. The world’s largest businesses, critical infrastructure organizations, and government agencies rely on Zscaler to secure users, branches, applications, data & devices, and to accelerate digital transformation initiatives. Distributed across more than 160 data centers globally, the Zscaler Zero Trust Exchange platform combined with advanced AI combats billions of cyber threats and policy violations every day and unlocks productivity gains for modern enterprises by reducing costs and complexity. Here, impact in your role matters more than title and trust is built on results. We believe in transparency and value constructive, honest debate—we’re focused on getting to the best ideas, faster. We build high-performing teams that can make an impact quickly and with high quality. To do this, we are building a culture of execution centered on customer obsession, collaboration, ownership and accountability. We champion an “AI Forward, People First” philosophy to help us accelerate and innovate, empowering our people to embrace their potential. If you’re driven by purpose, thrive on solving complex challenges and want to make a positive difference on a global scale, we invite you to bring your talents to Zscaler to help shape the future of cybersecurity. We are looking for a Staff Site Reliability Engineer to join our team. This role will report to the Senior Manager, Site Reliability Engineering and offers the flexibility of hybrid (3 days a week) out of San Jose, CA, or can be performed fully remote. As a key member of the Zero Trust Exchange team, you will be responsible for all aspects of the Zscaler production data center services, including servers, operating systems, storage, and supporting systems. You will be an instrumental part of the Cloud Operations team, ensuring the availability, latency, performance, efficiency, and scalability of a cloud that processes tens of billions of transactions daily.

Requirements

US Citizenship is required (due to the nature of assigned customers) and 5+ years of industry experience in a 24/7 NOC or Cloud Operations environment
Proficiency with programming languages such as Python or Bash
Deep understanding of networking standard protocols including HTTP, DNS, TCP/IP, ICMP, and the OSI Model
Hands-on experience with monitoring tools (e.g. Nagios, Grafana, Prometheus, etc.) and networking principles like Firewalls and Load Balancing
Ability and flexibility to work after hours or weekends for application releases and deployments in a fast-paced environment

Nice To Haves

Experience with programming languages like Go
Experience with incident management and being able to drive resolution
Bachelor’s or Master’s degree in computer science or relevant field (or equivalent experience)

Responsibilities

Design, code, and deploy software solutions and automation while looking for opportunities to optimize the existing code-base for maintainability and reusability
Create and deploy scalable monitoring systems and end-to-end solutions for a massively growing global infrastructure in collaboration with Software Engineering and Development teams
Monitor applications and services within the environments, participate in on-call rotation, and implement strategies to prevent future occurrences of issues
Resolve escalated issues and prevent recurring operational overhead by documenting and automating processes while deploying patches, upgrades, and administrative tools
Collaborate with cross-functional teams to recommend integration strategies for platforms and applications to constantly improve and identify opportunities for process improvement