Sr. Site Reliability Engineer

Practice By NumbersRedmond, WA

About The Position

This is an engineering-first Senior SRE role. We’re looking for senior engineers who have: Built and shipped significant backend systems and/or distributed platforms Owned services end-to-end in production (design → launch → on-call → reliability improvements) Led incident response and driven durable follow-ups Improved reliability by writing software and changing system design—not by adding manual process You’ll partner closely with product engineering to ensure reliability is d designed in from day one, while also building the tooling and platforms that make operating services safer and easier for every engineer. Engineers here own services end-to-end—from design to production reliability. Important: This is not a system administrator role. We are explicitly hiring an engineering leader in reliability.Engineering degree is an absolute requirement (BS/MS in CS/CE/EE or closely related engineering field).

Requirements

  • Engineering degree is mandatory: BS/MS in Computer Science, Computer Engineering, Electrical Engineering, or a closely related engineering field.
  • 6+ years experience in software engineering, SRE, infrastructure/platform engineering, or related.
  • Strong programming skills in Go, Python, Java, or similar (production-quality code).
  • Proven experience building and operating production backend services or distributed systems.
  • Meaningful experience in on-call rotations, incident leadership, and post-incident improvement execution.
  • Strong debugging ability across complex systems: latency, saturation, cascading failures, dependency issues.
  • Experience with cloud infrastructure (AWS preferred, GCP/Azure acceptable).

Nice To Haves

  • You’ve owned reliability for customer-facing services with clear, measurable improvements (e.g., higher availability, lower MTTR).
  • You’ve built internal platforms/tooling that made other engineers faster and reduced operational burden.
  • You’ve worked in an SRE culture with SLOs, error budgets, and blameless postmortems.
  • You’ve led multi-quarter reliability initiatives spanning multiple teams/services.

Responsibilities

  • Own reliability outcomes for critical services: availability, latency, incident rate, and recovery time.
  • Design and build reliable, scalable distributed systems that support mission-critical healthcare workflows.
  • Define and operationalize SLOs/SLIs and error budgets; drive adoption across teams and use them to prioritize work.
  • Lead incident response for high-severity issues; improve on-call effectiveness and reduce alert fatigue.
  • Run blameless postmortems and ensure follow-ups are implemented, measured, and stick.
  • Write software to eliminate operational toil: automation, self-service tooling, guardrails, and developer platforms.
  • Raise the bar on observability (metrics/logs/traces), alerting strategy, and operational readiness.
  • Improve resilience through capacity planning, load testing, performance tuning, and failure testing.
  • Mentor engineers (SRE and product engineers) on reliability practices, debugging, and production ownership.
  • Drive cross-team improvements like production readiness reviews, release safety (progressive delivery), and standard runbooks.

Benefits

  • Build and operate mission-critical healthcare infrastructure that supports real patient workflows.
  • High impact: reliability work directly improves customer trust and revenue-critical operations.
  • Small team with high ownership, autonomy, and ability to influence architecture.
  • Strong engineering culture focused on automation, simplicity, and measurable outcomes.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service