Principal Site Reliability Engineer

Palo Alto Networks

94d•Onsite

About The Position

Palo Alto Networks runs a large infrastructure and is one of the largest GCP customers. As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experience Management) team, you will be part of a team supporting the services that provide end-to-end visibility and self-healing capabilities for our global customers. This includes automation, architecture, performance, observability, troubleshooting, security, and reliability. Our Infrastructure Platform stack includes Terraform, Kubernetes, GitLab CI, ArgoCD, Prometheus, Grafana, Loki, Docker, GCP, AWS, Vault, Kafka, MySQL, Python, Bash, and Go.

Requirements

7+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering.
The candidate must be familiar with and demonstrate proficiency in using code assist and AI productivity tools such as Claude code, Cursor, Windsurf, or GitHub Copilot to accelerate development and troubleshooting.
Expertise in building high-availability, scalable cloud-native applications on GCP (preferred) or AWS.
Expertise in configuration management and IaC (Terraform, Helm, Ansible).
Strong proficiency in programming languages like Python, Go, or Java; experience with data streaming frameworks like Kafka or Apache Pulsar is a plus.
Deep experience in Kubernetes (GKE/EKS), container networking, and Linux internals.
Experience with GitOps principles and tools like GitLab CI and ArgoCD.
Familiarity with compliance and security frameworks (FedRAMP, SOC2) and automating policy-as-code.
Excellent communication skills, with a "rally support" mindset to collaborate across multi-functional teams.
BS or MS in Computer Science, a related field, or equivalent professional/military experience.

Responsibilities

Drive the success of SRE and DevOps through expert contributions in CI/CD and AIOps initiatives, moving the organization toward self-healing infrastructure.
Architect "Golden Paths" for service delivery, ensuring that SLOs, error budgets, and automated canary analysis are integrated by default.
Design, build, and operate reliable, secure Cloud infrastructure that supports high-scale synthetic monitoring and Real User Monitoring (RUM).
Ensure applications are production-ready, scalable, and resilient, collaborating closely with developers, researchers, and data scientists.
Develop tools and automation frameworks that champion Infrastructure as Code (IaC) and Monitoring as Code (MaC).
Lead root cause analysis (RCA) of critical business and production issues, driving improvements that prevent recurrence.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume