Sr. Staff Site Reliability Engineer

Obsidian Security•Palo Alto, CA

48d

About The Position

As a Sr. Staff SRE at Obsidian, you will define and drive the company-wide reliability vision for a complex, multi-tenant SaaS platform serving enterprise and financial customers. You will operate as a strategic partner to DevOps and Platform Engineering leadership, shaping a unified reliability strategy that scales across the organization. Your core mandate: ensure Obsidian detects, diagnoses, and communicates system issues before customers are impacted—consistently and predictably. This is a hands-on technical role that involves architecting and leading the implementation of systems that handle real-world complexity, including upstream SaaS dependencies, sparse and noisy signals, and mission-critical enterprise workloads.

Requirements

5+ years in SRE, Production Engineering, or related roles
3+ years operating at a senior or technical leadership level (Staff or equivalent scope)
Deep expertise in: AWS and/or GCP
Kubernetes and Helm
Observability stacks (Prometheus, Grafana, or equivalent)
CI/CD systems (GitLab CI/CD, ArgoCD, etc.)
Proven experience designing and scaling reliability systems for multi-tenant SaaS platforms
Strong debugging and systems thinking across distributed microservices and legacy systems
Demonstrated ability to lead initiatives that improve incident detection, response, and system resilience
Hands-on engineering approach with a track record of building—not just configuring—reliability systems

Nice To Haves

Experience in B2B SaaS serving enterprise or financial customers
Familiarity with third-party SaaS connector architectures and ingestion patterns
Experience building anomaly detection or intelligent alerting systems
Experience designing customer-facing status pages and incident communication frameworks

Responsibilities

Define and lead long-term reliability strategy across services.
Establish end-to-end system visibility frameworks and guide architecture for observability, detection, and resilience.
Partner across teams to embed reliability, standardize SLI/SLOs, and serve as a technical escalation expert.
Build intelligent detection systems (anomaly detection, connector health models) and enable self-service observability.
Define and evolve a tiered incident communication strategy, improve response practices, and lead postmortems to strengthen reliability and customer trust.
Contribute hands-on to system design, monitoring, and debugging across distributed systems and data pipelines.