Sr. Staff Site Reliability Engineer

Obsidian SecurityPalo Alto, CA

About The Position

As a Sr. Staff SRE at Obsidian, you will define and drive the company-wide reliability vision for a complex, multi-tenant SaaS platform serving enterprise and financial customers. You will operate as a strategic partner to DevOps and Platform Engineering leadership, shaping a unified reliability strategy that scales across the organization. Your core mandate: ensure Obsidian detects, diagnoses, and communicates system issues before customers are impacted—consistently and predictably. This is a hands-on technical role that involves architecting and leading the implementation of systems that handle real-world complexity, including upstream SaaS dependencies, sparse and noisy signals, and mission-critical enterprise workloads.

Requirements

  • 5+ years in SRE, Production Engineering, or related roles
  • 3+ years operating at a senior or technical leadership level (Staff or equivalent scope)
  • Deep expertise in: AWS and/or GCP
  • Kubernetes and Helm
  • Observability stacks (Prometheus, Grafana, or equivalent)
  • CI/CD systems (GitLab CI/CD, ArgoCD, etc.)
  • Proven experience designing and scaling reliability systems for multi-tenant SaaS platforms
  • Strong debugging and systems thinking across distributed microservices and legacy systems
  • Demonstrated ability to lead initiatives that improve incident detection, response, and system resilience
  • Hands-on engineering approach with a track record of building—not just configuring—reliability systems

Nice To Haves

  • Experience in B2B SaaS serving enterprise or financial customers
  • Familiarity with third-party SaaS connector architectures and ingestion patterns
  • Experience building anomaly detection or intelligent alerting systems
  • Experience designing customer-facing status pages and incident communication frameworks

Responsibilities

  • Define and lead long-term reliability strategy across services.
  • Establish end-to-end system visibility frameworks and guide architecture for observability, detection, and resilience.
  • Partner across teams to embed reliability, standardize SLI/SLOs, and serve as a technical escalation expert.
  • Build intelligent detection systems (anomaly detection, connector health models) and enable self-service observability.
  • Define and evolve a tiered incident communication strategy, improve response practices, and lead postmortems to strengthen reliability and customer trust.
  • Contribute hands-on to system design, monitoring, and debugging across distributed systems and data pipelines.

Benefits

  • Competitive compensation with equity
  • 401k
  • Comprehensive healthcare with dental and vision coverage
  • Flexible paid time off
  • Paid holiday time off
  • 12 weeks of new parent or family leave
  • Personal and professional development resources
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service