Site Reliability Engineering Manager

RapidSOS•Boston, NY

3d•Remote

About The Position

This is an engineering leadership role, not simply an on-call manager. The SRE Manager owns two things: keeping RapidSOS's cloud infrastructure running reliably, and helping product teams get to a place where they can run their own services without routing every operational issue through SRE. RapidSOS powers real-time emergency response by connecting life-critical data to first responders, so reliability here directly impacts outcomes in moments that matter. You'll lead the SRE Operations team and report to the Director of SRE & Platform Engineering. The team has real roots in NOC-style operations, and the honest goal of this role is to move it toward something more engineering-focused and proactive: better tooling, better practices, more ownership at the service team level. That's a gradual transition, and you'll be the one shaping how it happens.

Requirements

7+ years in SRE, platform engineering, or DevOps, with at least two years where you were responsible for a team and not just your own work
You’ve been directly responsible for Kubernetes and AWS infrastructure in production environments where uptime and resilience are critical
Experience moving a team from reactive ops toward engineering-first reliability practices
You’ve worked collaboratively with engineering teams to proactively improve reliability, scalability, and operational readiness before issues reach production
Ability to write Python, review production-quality scripts, and tooling
You’ve applied SLOs, error budgets, and blameless postmortems in practice to improve reliability and drive better engineering decisions
Hands-on familiarity with: Terraform/Atlantis, Kubernetes/Helm/ArgoCD, Datadog, Concourse CI/GitHub Actions, RabbitMQ, and AWS (EKS, RDS/Aurora, ElastiCache, VPC networking, IAM, KMS, Route53)

Responsibilities

Own the reliability, scalability, and operational health of RapidSOS Kubernetes clusters, shared services, and core AWS infrastructure; ensure upgrades, capacity planning, node scaling, and testing that multi-region failover actually works
Drive the IaC foundation in Terraform/Atlantis and champion infrastructure-as-code as a core engineering standard
Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate what they ship; the goal is for product teams to own their services, not to have SRE own everything on their behalf
Maintain proactive reliability work: capacity planning, failure mode analysis, runbook quality, and chaos engineering exercises; run reliability reviews before major launches and organize failure mode exercises with product teams
Drive blameless postmortem practice, ensures every significant incident produces systemic improvements with clear ownership and closure
Run the Tier 1 on-call rotation: scheduling for primary and secondary engineers, coordination with the 3rd-party NOC, and keeping incident escalation processes smooth and manageable
Lead incident command on Sev-1s, escalate when needed, and keep engineering leadership informed throughout
Lead and grow a high-impact team by mentoring engineers, owning headcount, and thinking ahead about what the team needs as the function grows
Shape the team’s long-term AI strategy for infrastructure and operations by identifying opportunities for AI-driven automation and insight generation, evaluating tooling and workflows, and operationalizing best practices for scalable team-wide usage
Own reserved instance strategy and the team's AWS cost footprint, error budgets and SLOs across production services and communicate that picture clearly to engineering and product leadership
Work alongside Platform SRE on bigger infrastructure projects: Gateway API adoption, cross-region architecture, security changes