Senior Manager, Site Reliability Engineering

Clover Health

1d•$187,000 - $243,000•Remote

About The Position

At Counterpart Health, we are transforming healthcare and improving patient care with our innovative primary care tool, Counterpart Assistant. By supporting Primary Care Physicians (PCPs), we deliver improved outcomes at lower cost through early diagnosis and longitudinal care management of chronic conditions. We're looking for a Senior Manager of Site Reliability Engineering to join our team. You'll lead a team of ~10 SREs across North America, UK, HK, and New Zealand — owning both the day-to-day operations and the long-term technical direction of the SRE organization. This role sits at the intersection of people leadership, technical depth, and strategic partnership: you're here to make Counterpart’s infrastructure reliable, scalable, and cost-efficient — and to transform the SRE team's engagement model from reactive support to proactive collaboration with our product engineering pillars.

Requirements

6+ years managing an SRE team and 10+ years of hands-on SRE or infrastructure engineering experience.
Deeply comfortable with our core stack: Kubernetes, GCP (GKE, Cloud SQL, Pub/Sub, GCS), Terraform, Helm, ArgoCD, PostgreSQL, and Prometheus/Grafana.
Strong programming skills in Python and/or Go, and comfortable writing and reviewing infrastructure tooling code — including using AI coding tools to do so.
Experience with CI/CD pipelines (GitHub Actions) and a track record of building or improving developer tooling and automation.
Sound build vs. buy judgment — default to the right answer, not the easiest one, and comfortable building internal tooling when existing solutions don't fit.
Experience leading teams across multiple time zones and a track record of developing engineers into strong technical contributors.

Responsibilities

Lead and grow our SRE team of ~10 engineers, including hiring, retention, career development, and performance management across multiple time zones (US, HK, NZ).
Build strategic partnerships with product engineering pillars — shifting SRE from reactive, ticket-based support to proactive co-ownership of reliability outcomes.
Scale our multi-tenant infrastructure to support new customer onboarding and growing patient populations.
Own cloud cost management and FinOps practices, building frameworks that balance cost control with reliability and performance.
Champion developer self-service and platform engineering. Build self-service capabilities so product teams can manage routine operations without filing SRE tickets. Establish SLOs/SLIs for critical services and improve alert quality so every page is meaningful.
Ensure the SRE team is fully leveraging AI tooling in their workflows — using tools like Claude Code for IaC generation, log analysis, root cause investigation, and automating repetitive work — at the same level as the rest of engineering.