Principal Platform Engineer, Observability (CIPE)

Palo Alto Networks•Office - USA - CA - Headquarters, CA

2d•$147,000 - $237,500•Onsite

About The Position

Palo Alto Networks is seeking a Principal Software Engineer to architect, build, and evolve its observability platform across infrastructure, applications, and developer workflows. This role requires a hands-on technical leader with deep experience in open-source observability technologies and Chronosphere, as well as proficiency in building AI-enabled systems and developer experiences using modern AI coding tools. The engineer will act as a technical architect for the observability stack, collaborating with engineering, platform, SRE, and product teams to establish standards for metrics, logs, traces, profiling, synthetics, alerting, dashboards, and incident response. A key responsibility will be leading the integration of AI agents, copilots, and automation into observability workflows to make telemetry, debugging, and reliability operations accessible to both humans and AI agents. The role demands comfort operating at both strategic and implementation levels, including designing architecture, writing production-grade code, reviewing systems, mentoring engineers, and driving adoption across teams.

Requirements

7+ years of software engineering, platform engineering, infrastructure engineering, or SRE experience, with significant experience building production-grade distributed systems.
Deep hands-on experience with observability systems, including metrics, logs, traces, profiling, dashboards, synthetics, alerting, and incident workflows.
Strong expertise with OpenTelemetry, including SDKs, Collector pipelines, exporters, processors, receivers, semantic conventions, and instrumentation patterns.
Strong experience with Prometheus-compatible metrics, Alertmanager, scraping, cardinality management, federation, and remote write patterns.
Hands-on experience with distributed tracing systems such as Jaeger or similar technologies.
Experience with continuous profiling technologies.
Strong experience with synthetic monitoring and proactive availability testing, including API checks, browser-based checks, blackbox monitoring, dependency checks, and integration with alerting and SLO workflows.
Strong Kubernetes experience, including workload monitoring, service discovery, operators/controllers, Helm, resource management, cluster observability, and multi-tenant platform patterns.
Strong Python engineering skills, including building internal tools, automation, integrations, services, and instrumentation libraries.
Hands-on experience building real solutions, tools, and developer workflows using modern AI coding agents such as Claude, Codex, or equivalent — including prompt design, skill/tool/MCP authoring, agent orchestration, and integrating LLMs into production engineering systems.
Practical understanding of how to design AI-friendly platforms: structured APIs, machine-readable runbooks, telemetry schemas, and skills/tools that allow both humans and AI agents to operate observability effectively.
Experience designing and operating high-scale, highly available infrastructure systems.
Strong understanding of SLOs, SLIs, error budgets, incident response, on-call practices, production readiness, and reliability engineering principles.
Experience writing clear technical design documents, RFCs, standards, operational runbooks, and architecture recommendations.
Ability to influence teams through technical depth, collaboration, mentorship, and pragmatic decision-making.

Nice To Haves

Go, Java, Rust, or Node.js preferred programming languages.
Experience with Chronosphere.

Responsibilities

Design and lead the evolution of a modern observability platform using OpenTelemetry, Prometheus, Jaeger, Alertmanager, and related CNCF ecosystem tools.
Define architecture standards for telemetry collection, processing, storage, querying, visualization, alerting, retention, and governance.
Build scalable systems for metrics, distributed tracing, continuous profiling, log aggregation, synthetic monitoring, service health monitoring, and reliability analytics.
Establish best practices for instrumentation across services, infrastructure, Kubernetes workloads, CI/CD systems, and developer platforms.
Evaluate trade-offs around data cardinality, sampling, storage cost, retention, query performance, multi-tenancy, reliability, and operational complexity.
Make pragmatic recommendations on open source, self-managed, managed-service, and hybrid observability approaches.
Create paved-road observability patterns that help engineering teams instrument, monitor, debug, and operate services with minimal friction.
Lead adoption and standardization of OpenTelemetry across applications, services, infrastructure, and platform components.
Design and implement telemetry pipelines using OpenTelemetry Collector, exporters, processors, receivers, connectors, and custom extensions.
Define conventions for traces, metrics, logs, spans, attributes, resources, service names, correlation IDs, and semantic conventions.
Build libraries, SDK wrappers, golden paths, and internal tooling to simplify observability instrumentation for engineering teams.
Architect metrics systems using Prometheus-compatible formats, PromQL, remote write, federation, scraping strategies, service discovery, recording rules, and long-term storage backends.
Design alerting frameworks that reduce noise, improve signal quality, and align with SLOs, SLIs, error budgets, and incident response practices.
Create reusable alerting patterns for Kubernetes, infrastructure, applications, APIs, databases, queues, event-driven systems, and distributed services.
Define standards for dashboarding, runbooks, escalation policies, alert ownership, and production readiness.
Partner with SRE and engineering teams to mature monitoring practices and improve service reliability.
Build observability capabilities for Kubernetes environments, including cluster monitoring, workload telemetry, service mesh visibility, ingress and egress monitoring, and node-level insights.
Develop and maintain Helm charts, Kubernetes manifests, operators, sidecars, agents, DaemonSets, and deployment automation for observability components.
Work with platform teams to ensure observability systems are reliable, secure, multi-tenant, highly available, and easy to operate.
Define standards for resource usage, scaling, upgrades, failover, backup, disaster recovery, access control, and tenant isolation for observability infrastructure.
Support observability across multi-cluster, multi-region, and hybrid cloud environments where applicable.
Design and build AI-enabled observability workflows that allow both humans and AI agents to investigate incidents, query telemetry, summarize signals, and propose remediations.
Define and publish reusable AI skills, agents, and tools (e.g., Claude skills, Codex tools, MCP servers, structured prompts) that encode observability best practices and make platform capabilities consumable by engineering teams and autonomous agents.
Build paved-road AI integrations for triage, alert summarization, root-cause analysis, log/trace exploration, runbook generation, dashboard authoring, and post-incident review.
Establish standards for grounding AI agents in authoritative telemetry, runbooks, and service catalogs, with strong guardrails around accuracy, safety, cost, and auditability.
Use AI coding tools (Claude, Codex, and equivalents) as a first-class part of the engineering workflow — for code generation, refactoring, instrumentation rollouts, migrations, and platform automation — and define patterns the broader team can adopt.
Partner with platform, SRE, and product teams to evolve observability from human-only dashboards toward agent-assisted, self-serve reliability operations.