Senior/Staff Site Reliability Engineer

Mochi Health•San Francisco, CA

18d•$250,000 - $300,000•Onsite

About The Position

We’re looking for a Senior/Staff Site Reliability Engineer to build Mochi’s AI-driven APM and incident management system that alert and page, but learns. This is a foundational role at the intersection of SRE, platform engineering, and applied AI: you’ll design the feedback loops (human-in-the-loop / RLHF-style), guardrails, and automation that let our reliability posture improve over time. You’ll own the systems and workflows that turn incidents into intelligence: automated triage, root cause analysis, remediation, and bug-fix proposals (PRs, test runs, staged rollouts) when issues are code-level. If you’re excited by the idea of building a self-improving SRE “copilot”, this job is for you.

Requirements

7+ years in SRE / platform / infrastructure engineering, with a track record of owning production reliability at scale.
Deep experience operating Kubernetes-based systems in the cloud (AWS preferred), including networking, autoscaling, rollout strategies, and incident mitigation.
Strong software engineering ability—you can debug production issues across services, understand failure modes, and contribute code when needed (Python/Go/TypeScript are all great).
Expert-level grasp of observability and incident response: metrics, logs, tracing, alerting design, and postmortem-driven improvements.
Comfortable building automation that touches production—and obsessive about safety: least-privilege access, audit logs, approvals, canaries, and rollback.
Excited by AI tooling and agentic workflows (or already experienced): LLM-based triage/summarization, retrieval over runbooks/postmortems, evaluation harnesses, and feedback loops.
Strong communication and collaboration skills—you can lead during incidents, write clearly, and align teams around reliability priorities.
Startup mindset: you move fast, take end-to-end ownership, and love turning ambiguity into shipped systems.
Excited to work in-person with our team in San Francisco.

Nice To Haves

Experience building LLM-powered internal tools (incident copilots, automated debugging, RAG over docs/runbooks) and/or RLHF-style feedback pipelines.
Familiarity with security and compliance in regulated environments (HIPAA, SOC 2, audit requirements, PHI handling).
Experience with chaos engineering / game days and resilience testing programs.
Experience building CI/CD guardrails and progressive delivery systems (canaries, automated verification, safe rollout policies).
Prior work on distributed tracing standards (OpenTelemetry), service meshes, or large-scale event-driven systems.

Responsibilities

Build an AI-driven SRE platform that ingests telemetry (logs/metrics/traces), deploy events, and incident artifacts to detect anomalies, summarize failures, and propose mitigations.
Design a human-in-the-loop learning loop (RLHF-style) so the system gets better with every incident: capturing decisions, outcomes, and postmortems into training/evaluation data.
Create safe auto-remediation capabilities: runbook execution, automated rollbacks, feature-flag actions with strong guardrails, auditability, and progressive rollout controls.
Build tooling that can propose bug fixes: generate well-scoped PRs, run tests, support canary releases—with clear handoff and approval flows.
Define and operationalize SLOs/SLIs and error budgets for critical user journeys (patient onboarding, provider workflows, pharmacy fulfillment, billing, etc.).
Level up observability end-to-end: alert quality, dashboarding, tracing standards, and “unknown unknown” detection.
Lead incident response excellence: on-call improvements, incident command, blameless postmortems, and driving systemic fixes that reduce repeat failures.
Partner with product + engineering teams to reduce toil and improve reliability via better architecture, load testing, resilience testing, and capacity planning.
Establish reliability standards and patterns across the org (golden signals, deployment safety, dependency management, fault isolation).

Benefits

Daily Meals and Espresso Bar - Breakfast, lunch, and dinner every weekday. Our on-site barista keeps the espresso and matcha flowing all day
Pre-Tax Commuter Perks - Save on transit and parking through pre-tax commuter benefits
Top-of-Market Compensation - We offer competitive salaries along with generous equity packages so you can share in the success you help create
Profitable and Rapid Growth - We’re scaling fast, with financial discipline and long-term vision. No VC constraints, just sustainable momentum and smart decisions
High-Impact Work - Help shape the future of digital healthcare. Your work here directly improves lives and scales nationwide
World-Class Team - Collaborate with teammates from Tesla, SpaceX, Citadel, Harvard, IIT, and more. We value excellence, humility, and empathy in equal measure
Comprehensive Benefits - 401(k) with match, generous time off, life insurance, and high-quality medical, dental, and vision plans
Mochi Health Membership – We cover your monthly subscription fee so you can experience the same care as our patients (medications not included)
Time to Recharge – Enjoy unlimited PTO, generous company holidays, and true flexibility. We trust you to take the time you need to rest, reset, and thrive
Wellness First – From weekly mindfulness sessions to group workouts and fitness perks, your physical and mental health are top priority
Team Socials and Community - We make time to connect through regular socials, happy hours, and spontaneous events. Our stocked kitchen doesn’t hurt either
Downtown SF HQ - Our San Francisco office is just steps from BART, Muni, and great food. It’s designed for deep work and casual collaboration