Senior Site Reliability Engineer (Upmarket)

Heidi•San Francisco, CA

63d•Onsite

About The Position

Heidi is building an AI Care Partner that works alongside clinicians to make healthcare continuous and deeply human. The company is a team of doctors, engineers, designers, researchers, and creatives building tools that help clinicians stay focused on their patients. In just 18 months, Heidi has given back more than 18 million hours to healthcare professionals, supporting 73 million patient visits in 116 countries, with more than two million patient visits each week powered by Heidi worldwide. Backed by nearly $100 million in funding, Heidi is growing in the US, UK, Canada, and Europe, partnering with leading health systems including the NHS, Beth Israel Lahey Health, and Monash Health.

Requirements

3–6+ years in SRE, DevOps, Platform, or operations-heavy engineering roles.
Experience supporting production systems and participating in on-call rotations.
Comfortable debugging live systems under pressure.
Experience operating cloud infrastructure (AWS preferred).
Working knowledge of Kubernetes and containerised workloads.
Infrastructure as Code experience (Terraform or similar).
Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc).
Scripting or automation experience (Python, Bash, or similar).

Nice To Haves

AWS preferred
Terraform or similar
Datadog, Prometheus, etc
Python, Bash, or similar

Responsibilities

Participate in on-call and incident response: Respond to production incidents, contribute to service restoration, and support clear communication during incidents. Over time, take increasing responsibility for leading incidents end-to-end.
Improve operational reliability: Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.
Own parts of the production environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services, with growing ownership as familiarity increases.
Strengthen observability: Improve dashboards, alerts, logs, and traces so issues are detected earlier and diagnosed faster, with a strong focus on actionable signals.
Reduce operational toil: Automate repetitive tasks, simplify runbooks, and improve tooling to make on-call and day-to-day operations easier and safer.
Support safe change: Improve deployments, rollback mechanisms, and operational readiness to reduce the risk of incidents caused by change.
Contribute to operational practices: Write and maintain runbooks, participate in blameless post-mortems, and help improve incident response processes over time.
Collaborate closely with engineers: Work with product and feature teams to improve production readiness, service ownership, and reliability expectations.

Benefits

Healthcare, Dental, Vision benefit options
401k with 3% match
Personal development budget of $500 per annum
Become an owner, with shares (equity) in the company, if Heidi wins, we all win
The rare chance to create a global impact as you immerse yourself in one of the leading healthtech startups
The opportunity to fast track your startup career!

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume