Staff Site Reliability Engineer - Observability GCP

Okta•Washington, DC

4d•$194,000 - $267,000•Hybrid

About The Position

We are seeking a highly technical Observability Site Reliability Engineer with a specialty in Google Cloud, to own and expand our Observability ecosystem into GCP. In this role, you will move beyond simple monitoring to delivering a world class, comprehensive, scalable Observability Platform that enables our SRE teams and business partners. You will treat infrastructure as code—utilizing Terraform and strong coding proficiency in Go, Python, or Ruby—to automate the deployment of agents and collectors across complex distributed systems.

Requirements

Minimum 5+ Experience scaling and managing observability in a Google Cloud platform.
Expertise in creating intuitive, actionable Splunk or Grafana dashboards that correlate data across multiple sources.
Minimum 3+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.
Strong coding skills in Python, Go for building internal tools and automating workflows.
Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/GKE).
A data-driven approach to debugging complex, cross-service performance bottlenecks.
Ability to access federal environments and/or have access to protected federal data.
Must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.

Nice To Haves

Hands-on experience with OpenTelemetry (OTel), Vector, or similar frameworks for instrumenting applications.
Experience in migrating Splunk to Grafana Loki
Experience managing observability native tools within AWS.

Responsibilities

Design, build, and maintain scalable observability infrastructure using tools like Terraform.
Optimize the collection, processing, and storage of Observabilty data to ensure high reliability and low latency of our Splunk and Grafana services
Participate in on-call rotations and lead post-incident reviews to drive systemic improvements and "observability-driven development."
Eliminate "toil" by automating the deployment and scaling of observability agents and collectors.