Site Reliability Engineering (SRE) Tech Lead

Obsidian SecurityPalo Alto, CA
1d$250,000 - $280,000

About The Position

As the SRE Tech Lead at Obsidian, you will define and build the reliability foundation for a complex, multi-tenant SaaS platform serving enterprise and financial customers. You will operate as a peer to the DevOps and Platform Engineering leads, driving a unified reliability strategy across the organization. Your core mandate: ensure Obsidian detects every system failure before customers do—and communicates proactively when issues arise. This is a hands-on technical leadership role with high ownership and visibility, reporting directly to the CTO. You will architect and implement systems that handle real-world complexity: upstream SaaS dependencies, sparse and noisy data, and mission-critical enterprise workloads.

Requirements

  • 7+ years in SRE, production engineering, or similar roles
  • 2+ years operating as a technical lead
  • Deep expertise with: AWS and/or GCP Kubernetes, Helm Observability stack (Prometheus, Grafana) CI/CD systems (GitLab CI/CD, ArgoCD)
  • Proven experience building monitoring for multi-tenant SaaS systems with complex data pipelines
  • Strong debugging skills across distributed microservices and legacy systems
  • Hands-on engineering mindset — able to instrument services directly, not just configure tooling
  • Track record of building or significantly improving incident detection and response systems

Nice To Haves

  • Experience in B2B SaaS serving enterprise or financial customers
  • Familiarity with third-party SaaS connector ingestion patterns
  • Experience building anomaly detection systems or baseline-aware alerting
  • Experience implementing customer-facing status pages and incident communication frameworks

Responsibilities

  • Map and instrument critical system paths for top-tier enterprise customers
  • Build connector health models to classify issues: Internal defects (“our bug”) Upstream SaaS outages Expected sparse/low-signal scenarios
  • Establish tiered incident communication: Public status page for all customers Direct outreach for high-priority accounts
  • Define and begin rollout of SLI/SLO standards across microservices
  • Develop self-service instrumentation tooling enabling engineering teams to own observability
  • Implement baseline-aware anomaly detection across all connectors (beyond static thresholds)
  • Mature incident response processes, including: Structured post-mortems Continuous reliability improvements

Benefits

  • Competitive compensation with equity and 401k
  • Comprehensive healthcare with dental and vision coverage
  • Flexible paid time off and paid holiday time off
  • 12 weeks of new parent or family leave
  • Personal and professional development resources
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service