Site Reliability Engineering (SRE) Tech Lead

Obsidian Security•Palo Alto, CA

1d•$250,000 - $280,000

About The Position

As the SRE Tech Lead at Obsidian, you will define and build the reliability foundation for a complex, multi-tenant SaaS platform serving enterprise and financial customers. You will operate as a peer to the DevOps and Platform Engineering leads, driving a unified reliability strategy across the organization. Your core mandate: ensure Obsidian detects every system failure before customers do—and communicates proactively when issues arise. This is a hands-on technical leadership role with high ownership and visibility, reporting directly to the CTO. You will architect and implement systems that handle real-world complexity: upstream SaaS dependencies, sparse and noisy data, and mission-critical enterprise workloads.

Requirements

7+ years in SRE, production engineering, or similar roles
2+ years operating as a technical lead
Deep expertise with: AWS and/or GCP Kubernetes, Helm Observability stack (Prometheus, Grafana) CI/CD systems (GitLab CI/CD, ArgoCD)
Proven experience building monitoring for multi-tenant SaaS systems with complex data pipelines
Strong debugging skills across distributed microservices and legacy systems
Hands-on engineering mindset — able to instrument services directly, not just configure tooling
Track record of building or significantly improving incident detection and response systems

Nice To Haves

Experience in B2B SaaS serving enterprise or financial customers
Familiarity with third-party SaaS connector ingestion patterns
Experience building anomaly detection systems or baseline-aware alerting
Experience implementing customer-facing status pages and incident communication frameworks

Responsibilities

Map and instrument critical system paths for top-tier enterprise customers
Build connector health models to classify issues: Internal defects (“our bug”) Upstream SaaS outages Expected sparse/low-signal scenarios
Establish tiered incident communication: Public status page for all customers Direct outreach for high-priority accounts
Define and begin rollout of SLI/SLO standards across microservices
Develop self-service instrumentation tooling enabling engineering teams to own observability
Implement baseline-aware anomaly detection across all connectors (beyond static thresholds)
Mature incident response processes, including: Structured post-mortems Continuous reliability improvements