Site Reliability Engineer

HeidiSan Francisco, CA
1dOnsite

About The Position

Healthcare needs a better rhythm: one that keeps care continuous and deeply human. Heidi is building an AI Care Partner that works alongside clinicians to make that possible. We’re a team of doctors, engineers, designers, researchers, and creatives building tools that help clinicians stay focused on what matters most: their patients. In just 18 months, Heidi has given back more than 18 million hours to healthcare professionals — supporting 73 million patient visits in 116 countries. Today, more than two million patient visits each week are powered by Heidi worldwide. Backed by nearly $100 million in funding, we’re growing in the US, UK, Canada, and Europe, partnering with leading health systems including the NHS, Beth Israel Lahey Health, and Monash Health.

Requirements

  • 3–6+ years in SRE, DevOps, Platform, or operations-heavy engineering roles.
  • Experience supporting production systems and participating in on-call rotations.
  • Comfortable debugging live systems under pressure.
  • Experience operating cloud infrastructure (AWS preferred).
  • Working knowledge of Kubernetes and containerised workloads.
  • Infrastructure as Code experience (Terraform or similar).
  • Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc).
  • Scripting or automation experience (Python, Bash, or similar).

Responsibilities

  • Participate in on-call and incident response:
  • Respond to production incidents, contribute to service restoration, and support clear communication during incidents. Over time, take increasing responsibility for leading incidents end-to-end.
  • Improve operational reliability:
  • Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.
  • Own parts of the production environment:
  • Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services, with growing ownership as familiarity increases.
  • Strengthen observability:
  • Improve dashboards, alerts, logs, and traces so issues are detected earlier and diagnosed faster, with a strong focus on actionable signals.
  • Reduce operational toil:
  • Automate repetitive tasks, simplify runbooks, and improve tooling to make on-call and day-to-day operations easier and safer.
  • Support safe change:
  • Improve deployments, rollback mechanisms, and operational readiness to reduce the risk of incidents caused by change.
  • Contribute to operational practices:
  • Write and maintain runbooks, participate in blameless post-mortems, and help improve incident response processes over time.
  • Collaborate closely with engineers:
  • Work with product and feature teams to improve production readiness, service ownership, and reliability expectations.

Benefits

  • Healthcare, Dental, Vision benefit options
  • 401k with 3% match
  • Personal development budget of $500 per annum
  • Become an owner, with shares (equity) in the company, if Heidi wins, we all win
  • The rare chance to create a global impact as you immerse yourself in one of the leading healthtech startups
  • The opportunity to fast track your startup career!
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service