Principal Site Reliability Engineer

Palo Alto Networks•Office - USA - CA - Headquarters, CA

3d•Onsite

About The Position

The Cortex team builds and delivers the industry’s most advanced SecOps platform, consisting of XDR, XSIAM, XSOAR, and XPANSE. As a Principal Site Reliability Engineer within the Cortex DevOps team, you will serve as a technical leader responsible for driving the reliability, scalability, observability, and operational excellence strategy across the Cortex platform. You will partner closely with engineering, product, and infrastructure teams to influence architecture decisions, establish reliability standards, and build innovative solutions that improve service availability, performance, and operational efficiency at global scale. This role requires deep expertise in cloud infrastructure, observability, distributed systems, automation, and incident management. You will help shape the future direction of our observability and reliability platforms while mentoring engineers and driving best practices across the organization.

Requirements

10+ years of experience in Site Reliability Engineering, DevOps, Cloud Engineering, or related disciplines.
Deep expertise with Prometheus, Thanos, Grafana, OpenTelemetry, and modern observability platforms.
Strong understanding of SRE principles including SLIs, SLOs, error budgets, incident management, and operational excellence.
Expert knowledge of Google Cloud Platform (GCP), Amazon Web Services (AWS), or similar cloud platforms.
Expert-level experience with Kubernetes, Docker, and cloud-native architectures.
Strong software engineering and automation skills using Python, Linux, Terraform, Ansible, and GitOps practices.
Proven ability to influence technical direction and drive cross-functional initiatives across multiple engineering teams.

Nice To Haves

Experience building and operating observability platforms at large scale.
Experience implementing AI-driven operational tooling, automation, or AIOps solutions.
Strong communication and leadership skills with experience mentoring senior engineers and leading complex technical initiatives.
Ability to operate independently, influence stakeholders, and drive outcomes across organizational boundaries.

Responsibilities

Define and drive reliability, observability, and operational excellence standards across Cortex services and infrastructure.
Design and evolve large-scale observability platforms using technologies such as Prometheus, Thanos, Grafana, OpenTelemetry, and cloud-native monitoring solutions.
Partner with engineering teams to ensure services are designed, instrumented, and operated with reliability and scalability in mind.
Drive improvements in monitoring, alerting, incident management, and service health to proactively identify and prevent customer-impacting issues.
Lead initiatives focused on automation, self-healing systems, operational efficiency, and reduction of operational toil.
Influence architectural decisions and technology adoption to improve platform reliability, performance, and cost efficiency.
Mentor engineers and provide technical leadership across multiple teams and organizations.
Stay current with emerging technologies and industry trends, evaluating and implementing solutions that advance Cortex's operational capabilities.
Provide leadership during major incidents and drive post-incident reviews focused on systemic improvements.