Member of Technical Staff, Site Reliablity Engineer

VapiSan Francisco, CA
$200,000 - $270,000Remote

About The Position

Vapi is seeking a Site Reliability Engineer (SRE) to drive 99.99% call completion. This role is critical because Vapi runs live phone calls, and any stability issues can lead to dropped calls. The SRE will be responsible for incident command, owning SLOs and error budgets, and building a reliability culture from the ground up. This is a hands-on role where you will ship code (Go or TypeScript) for services that monitor and manage the platform, including auto-remediation, capacity forecasters, and oncall tooling. Key responsibilities include capacity planning, load testing, and KEDA-based autoscaling for Vapi's wscaler and workerpool-cron-scaler.

Requirements

  • Experience running incident command and postmortem discipline at scale on an oncall rotation.
  • Experience operating SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog.
  • Experience with capacity planning and load testing for production systems with real users.
  • Fluency in Kubernetes production ops, including pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, and graceful shutdown.
  • Knowledge of backpressure and autoscaling patterns, including KEDA and custom metrics scaling.
  • Ability to read and write code.

Nice To Haves

  • Ability to ship platform services in Go or TypeScript.
  • Experience in a real-time/latency-sensitive product environment where degraded performance means a dropped call.

Responsibilities

  • Drive 99.99% call completion.
  • Run incident command and own the postmortem process.
  • Define and manage SLOs and error budgets.
  • Build and maintain reliability tooling, including auto-remediation, capacity forecasters, and oncall tooling.
  • Perform capacity planning and load testing.
  • Tune autoscaling for wscaler and workerpool-cron-scaler using KEDA.
  • Ship code for platform services in Go or TypeScript.
  • Join the oncall rotation and address stability-gap incidents.
  • Define and implement SLOs for the call-completion path.
  • Set up error budgets and SLO-based alerting.
  • Conduct load tests against provider rate limits and per-org concurrency.
  • Drive measurable improvements in p99 call completion or MTTR.

Benefits

  • Competitive salary
  • Excellent equity ownership
  • Comprehensive health coverage (medical, dental, and vision plans)
  • Flexible time off
  • Catered meals
  • Transportation
  • Gym membership
  • $10k annual L&D budget
  • Quarterly off-sites
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service