Principal Site Reliability Engineer

Palo Alto Networks

65d•Onsite

About The Position

At Palo Alto Networks®, the mission is to protect the digital way of life by solving real-world problems with cutting-edge technology and bold thinking. The company aims to be the cybersecurity partner of choice, trailblazing the path and shaping the future of the industry, guided by values of Disruption, Collaboration, Execution, Integrity, and Inclusion. AI is integrated into their operations to augment individual impact. Collaboration is highly valued, with most teams working from the office full time, offering flexibility when needed, to support real-time problem-solving, stronger relationships, and precision in outcomes. Palo Alto Networks operates a large hybrid infrastructure and is a significant GCP customer. As a Site Reliability Engineer, you will join a team responsible for supporting services on this infrastructure, encompassing automation, architecture, performance, metrics, troubleshooting, security, and reliability. The technology stack includes Kubernetes, Docker, GCP, AWS, Ansible, Terraform, Vault, Gitlab, Spinnaker, Pub/sub, Bigtable, Memorystore, Bigquery, RabbitMq, Kafka, MySQL, Python, and Go, with an expectation to learn necessary technologies. The Wildfire team specifically manages the industry's largest cloud-based malware protection engine, utilizing machine learning and crowdsourced intelligence to prevent unknown malware variants, and the infrastructure team ensures the scalability and high availability of Wildfire clouds.

Requirements

BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
Proficient in Python and/or Go
Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
Experience in Production Engineering, DevOps, or Site Reliability
Expertise in the public cloud (GCP or AWS), especially in GCP
Strong Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
Excellent written and verbal communication, able to collaborate and rally support
Self-disciplined, self-managed, self-motivated, and strong sense of ownership, urgency, and drive
Passion for infrastructure and monitoring as code
Ready to understand and dissect new technology stacks quickly