Site Reliability Engineering Manager

RapidSOSBoston, NY
Remote

About The Position

This is an engineering leadership role, not simply an on-call manager. The SRE Manager owns two things: keeping RapidSOS's cloud infrastructure running reliably, and helping product teams get to a place where they can run their own services without routing every operational issue through SRE. RapidSOS powers real-time emergency response by connecting life-critical data to first responders, so reliability here directly impacts outcomes in moments that matter. You'll lead the SRE Operations team and report to the Director of SRE & Platform Engineering. The team has real roots in NOC-style operations, and the honest goal of this role is to move it toward something more engineering-focused and proactive: better tooling, better practices, more ownership at the service team level. That's a gradual transition, and you'll be the one shaping how it happens.

Requirements

  • 7+ years in SRE, platform engineering, or DevOps, with at least two years where you were responsible for a team and not just your own work
  • You’ve been directly responsible for Kubernetes and AWS infrastructure in production environments where uptime and resilience are critical
  • Experience moving a team from reactive ops toward engineering-first reliability practices
  • You’ve worked collaboratively with engineering teams to proactively improve reliability, scalability, and operational readiness before issues reach production
  • Ability to write Python, review production-quality scripts, and tooling
  • You’ve applied SLOs, error budgets, and blameless postmortems in practice to improve reliability and drive better engineering decisions
  • Hands-on familiarity with: Terraform/Atlantis, Kubernetes/Helm/ArgoCD, Datadog, Concourse CI/GitHub Actions, RabbitMQ, and AWS (EKS, RDS/Aurora, ElastiCache, VPC networking, IAM, KMS, Route53)

Responsibilities

  • Own the reliability, scalability, and operational health of RapidSOS Kubernetes clusters, shared services, and core AWS infrastructure; ensure upgrades, capacity planning, node scaling, and testing that multi-region failover actually works
  • Drive the IaC foundation in Terraform/Atlantis and champion infrastructure-as-code as a core engineering standard
  • Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate what they ship; the goal is for product teams to own their services, not to have SRE own everything on their behalf
  • Maintain proactive reliability work: capacity planning, failure mode analysis, runbook quality, and chaos engineering exercises; run reliability reviews before major launches and organize failure mode exercises with product teams
  • Drive blameless postmortem practice, ensures every significant incident produces systemic improvements with clear ownership and closure
  • Run the Tier 1 on-call rotation: scheduling for primary and secondary engineers, coordination with the 3rd-party NOC, and keeping incident escalation processes smooth and manageable
  • Lead incident command on Sev-1s, escalate when needed, and keep engineering leadership informed throughout
  • Lead and grow a high-impact team by mentoring engineers, owning headcount, and thinking ahead about what the team needs as the function grows
  • Shape the team’s long-term AI strategy for infrastructure and operations by identifying opportunities for AI-driven automation and insight generation, evaluating tooling and workflows, and operationalizing best practices for scalable team-wide usage
  • Own reserved instance strategy and the team's AWS cost footprint, error budgets and SLOs across production services and communicate that picture clearly to engineering and product leadership
  • Work alongside Platform SRE on bigger infrastructure projects: Gateway API adoption, cross-region architecture, security changes

Benefits

  • Competitive salary and benefits and equity participation
  • A dynamic, flexible and fun start-up work environment with a highly talented team
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service