Principal Site Reliability Engineer (CDSS - Advanced URL Filtering)

Palo Alto Networks•Santa Clara, CA

39d•Onsite

About The Position

Palo Alto Networks operates a vast hybrid infrastructure and is among the largest GCP customers. As a Site Reliability Engineer on the CDSS Advanced URL Filtering team, you will play a key role in shaping the reliability and scalability of our systems. This position offers the opportunity to work on cutting-edge technologies, tackle complex challenges, and contribute to the success of innovative solutions that protect our customers.

Requirements

Creative thinker and collaborative team player with strong communication skills and a drive to make a meaningful impact.
Cloud and Infrastructure: Expertise in provisioning and managing cloud infrastructure on public or private cloud platforms (GCP, AWS, or Azure preferred), with strong proficiency in tools like Kubernetes, Terraform, and Ansible.
Database Operation: Proficiency in managing and optimizing SQL and NoSQL databases, including operational tasks such as provisioning, scaling, monitoring, backups, and troubleshooting. Experience with platforms like MongoDB, Redis, PostgreSQL, and MySQL is preferred.
System Reliability: Deep understanding of distributed systems, high-availability architecture, and strategies for scaling and optimizing system performance.
Service-Level Management: Proven experience defining and managing SLAs, SLOs, and SLIs to ensure service reliability and business alignment.
Cost Optimization: Expertise in monitoring and optimizing cloud infrastructure costs, including resource allocation and implementing efficient practices.
Load Balancing and Networking: Hands-on experience with Envoy or similar load balancing technologies, along with strong Linux system administration and advanced network troubleshooting skills.
Automation and Development: Advanced skills in programming and automation using Python, Golang, or shell scripting to streamline operations and enhance system reliability.
Production Deployment and Best Practices: Proven experience managing production deployments, ensuring system stability, and enforcing DevOps best practices.
Monitoring and CI/CD: Familiarity with CI/CD pipelines (GitLab CI preferred) and expertise in designing robust monitoring and alerting systems.
Collaboration and Communication: Exceptional ability to work with cross-functional teams, communicate effectively, and provide technical leadership.
Mindset and Motivation: Self-disciplined, self-managed, and self-motivated, with a strong sense of ownership, urgency, and drive. Passionate about infrastructure and monitoring as code.
Education and Experience: BS/MS in Computer Science, Computer Engineering, or a related field, with 7+ years of hands-on industry experience in Site Reliability Engineering or a similar role managing and improving complex systems at scale.

Responsibilities

Optimize infrastructure costs by monitoring resource utilization, rightsizing instances, and reducing waste to improve cost-efficiency.
Define and manage service-level objectives (SLOs) and related metrics to ensure service reliability and align with business goals.
Design and maintain secure cloud infrastructures that prioritize reliability, scalability, and efficiency.
Develop expertise in new technologies to enhance infrastructure and operations.
Collaborate with cross-functional teams to ensure applications are production-ready and highly available.
Automate deployments, monitoring, and alerting to streamline operations and improve reliability.
Diagnose and resolve critical issues, driving optimization and continuous improvement.
Participate in on-call rotations to support seamless service operations.
Contribute to design reviews to enhance system performance and scalability.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume