Staff Site Reliability Engineer

Tabs•New York, NY

50d•Onsite

About The Position

We’re looking for a Staff Site Reliability Engineer to lead the evolution of Tabs’ platform as we scale. In this role, you’ll operate as a senior individual contributor, partnering closely with engineering and product teams to design, build, and operate systems that are reliable, observable, and easy to develop on. You’ll own our infrastructure direction, shape how we ship software, and set the standard for operational excellence across the company. This is a high-impact role for someone who enjoys solving complex systems problems, influencing architecture, and raising the reliability bar without becoming a gatekeeper.

Requirements

10+ years in SRE, infrastructure, or backend engineering roles
Strong software engineering experience in one or more modern languages
Expertise operating distributed systems in production at scale
Deep experience with AWS, observability tooling, and CI/CD systems
Comfortable navigating ambiguity and setting direction in a fast-moving environment
You’ve run production systems on AWS and can lead platform-level change
You think in systems: risk, rollback strategy, blast radius, and feedback loops
You treat CI/CD and environments as products that should be fast, reliable, and self-serve
You influence through trust and clarity rather than control
You balance pragmatism with long-term system health
You value learning from failure and improving processes over assigning blame
You communicate clearly and work well across teams

Responsibilities

AWS infrastructure direction and platform evolution, including the migration from ECS/Fargate toward a more modern, scalable runtime
CI/CD systems with a strong emphasis on developer experience, safety, and automation (GitHub Actions today; maturing CD tomorrow)
Ephemeral environments and preview deploys to speed iteration and increase confidence in changes
Observability standards across metrics, logs, and tracing, including alert hygiene, dashboards, and SLO development
Incident response, postmortems, and the reliability culture that surrounds them
Define and evolve reliability standards, SLIs, SLOs, and error budgets
Improve observability, alerting, and incident processes across services
Lead high-severity incidents and drive clear, actionable follow-ups
Partner with engineering teams to design resilient, scalable systems
Build automation to reduce toil and lower operational risk
Mentor engineers and influence best practices across teams

Benefits

Competitive compensation and equity
Unlimited PTO
Up to 100% employer covered monthly healthcare premium (medical, dental, vision)
Lunch provided via Sharebite, plus dinner for any later office days.
Parental leave up to 12 weeks
Tax free commuter and parking benefits
Voluntary insurances (Life, Hospital, Critical Illness, Accident)
Employee Assistance Program (Rightway)
401k

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume