Senior/Staff Site Reliability Engineer

Mochi HealthSan Francisco, CA
1d$250,000 - $300,000Onsite

About The Position

We’re looking for a Senior/Staff Site Reliability Engineer to build Mochi’s AI-driven APM and incident management system that alert and page, but learns. This is a foundational role at the intersection of SRE, platform engineering, and applied AI: you’ll design the feedback loops (human-in-the-loop / RLHF-style), guardrails, and automation that let our reliability posture improve over time. You’ll own the systems and workflows that turn incidents into intelligence: automated triage, root cause analysis, remediation, and bug-fix proposals (PRs, test runs, staged rollouts) when issues are code-level. If you’re excited by the idea of building a self-improving SRE “copilot”, this job is for you.

Requirements

  • 7+ years in SRE / platform / infrastructure engineering, with a track record of owning production reliability at scale.
  • Deep experience operating Kubernetes-based systems in the cloud (AWS preferred), including networking, autoscaling, rollout strategies, and incident mitigation.
  • Strong software engineering ability—you can debug production issues across services, understand failure modes, and contribute code when needed (Python/Go/TypeScript are all great).
  • Expert-level grasp of observability and incident response: metrics, logs, tracing, alerting design, and postmortem-driven improvements.
  • Comfortable building automation that touches production—and obsessive about safety: least-privilege access, audit logs, approvals, canaries, and rollback.
  • Excited by AI tooling and agentic workflows (or already experienced): LLM-based triage/summarization, retrieval over runbooks/postmortems, evaluation harnesses, and feedback loops.
  • Strong communication and collaboration skills—you can lead during incidents, write clearly, and align teams around reliability priorities.
  • Startup mindset: you move fast, take end-to-end ownership, and love turning ambiguity into shipped systems.
  • Excited to work in-person with our team in San Francisco.

Nice To Haves

  • Experience building LLM-powered internal tools (incident copilots, automated debugging, RAG over docs/runbooks) and/or RLHF-style feedback pipelines.
  • Familiarity with security and compliance in regulated environments (HIPAA, SOC 2, audit requirements, PHI handling).
  • Experience with chaos engineering / game days and resilience testing programs.
  • Experience building CI/CD guardrails and progressive delivery systems (canaries, automated verification, safe rollout policies).
  • Prior work on distributed tracing standards (OpenTelemetry), service meshes, or large-scale event-driven systems.

Responsibilities

  • Build an AI-driven SRE platform that ingests telemetry (logs/metrics/traces), deploy events, and incident artifacts to detect anomalies, summarize failures, and propose mitigations.
  • Design a human-in-the-loop learning loop (RLHF-style) so the system gets better with every incident: capturing decisions, outcomes, and postmortems into training/evaluation data.
  • Create safe auto-remediation capabilities: runbook execution, automated rollbacks, feature-flag actions with strong guardrails, auditability, and progressive rollout controls.
  • Build tooling that can propose bug fixes: generate well-scoped PRs, run tests, support canary releases—with clear handoff and approval flows.
  • Define and operationalize SLOs/SLIs and error budgets for critical user journeys (patient onboarding, provider workflows, pharmacy fulfillment, billing, etc.).
  • Level up observability end-to-end: alert quality, dashboarding, tracing standards, and “unknown unknown” detection.
  • Lead incident response excellence: on-call improvements, incident command, blameless postmortems, and driving systemic fixes that reduce repeat failures.
  • Partner with product + engineering teams to reduce toil and improve reliability via better architecture, load testing, resilience testing, and capacity planning.
  • Establish reliability standards and patterns across the org (golden signals, deployment safety, dependency management, fault isolation).

Benefits

  • Daily Meals and Espresso Bar - Breakfast, lunch, and dinner every weekday. Our on-site barista keeps the espresso and matcha flowing all day
  • Pre-Tax Commuter Perks - Save on transit and parking through pre-tax commuter benefits
  • Top-of-Market Compensation - We offer competitive salaries along with generous equity packages so you can share in the success you help create
  • Profitable and Rapid Growth - We’re scaling fast, with financial discipline and long-term vision. No VC constraints, just sustainable momentum and smart decisions
  • High-Impact Work - Help shape the future of digital healthcare. Your work here directly improves lives and scales nationwide
  • World-Class Team - Collaborate with teammates from Tesla, SpaceX, Citadel, Harvard, IIT, and more. We value excellence, humility, and empathy in equal measure
  • Comprehensive Benefits - 401(k) with match, generous time off, life insurance, and high-quality medical, dental, and vision plans
  • Mochi Health Membership – We cover your monthly subscription fee so you can experience the same care as our patients (medications not included)
  • Time to Recharge – Enjoy unlimited PTO, generous company holidays, and true flexibility. We trust you to take the time you need to rest, reset, and thrive
  • Wellness First – From weekly mindfulness sessions to group workouts and fitness perks, your physical and mental health are top priority
  • Team Socials and Community - We make time to connect through regular socials, happy hours, and spontaneous events. Our stocked kitchen doesn’t hurt either
  • Downtown SF HQ - Our San Francisco office is just steps from BART, Muni, and great food. It’s designed for deep work and casual collaboration
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service