Staff Site Reliability Engineer

TabsNew York, NY
1dOnsite

About The Position

We’re looking for a Staff Site Reliability Engineer to lead the evolution of Tabs’ platform as we scale. In this role, you’ll operate as a senior individual contributor, partnering closely with engineering and product teams to design, build, and operate systems that are reliable, observable, and easy to develop on. You’ll own our infrastructure direction, shape how we ship software, and set the standard for operational excellence across the company. This is a high-impact role for someone who enjoys solving complex systems problems, influencing architecture, and raising the reliability bar without becoming a gatekeeper.

Requirements

  • 10+ years in SRE, infrastructure, or backend engineering roles
  • Strong software engineering experience in one or more modern languages
  • Expertise operating distributed systems in production at scale
  • Deep experience with AWS, observability tooling, and CI/CD systems
  • Comfortable navigating ambiguity and setting direction in a fast-moving environment
  • You’ve run production systems on AWS and can lead platform-level change
  • You think in systems: risk, rollback strategy, blast radius, and feedback loops
  • You treat CI/CD and environments as products that should be fast, reliable, and self-serve
  • You influence through trust and clarity rather than control
  • You balance pragmatism with long-term system health
  • You value learning from failure and improving processes over assigning blame
  • You communicate clearly and work well across teams

Responsibilities

  • AWS infrastructure direction and platform evolution, including the migration from ECS/Fargate toward a more modern, scalable runtime
  • CI/CD systems with a strong emphasis on developer experience, safety, and automation (GitHub Actions today; maturing CD tomorrow)
  • Ephemeral environments and preview deploys to speed iteration and increase confidence in changes
  • Observability standards across metrics, logs, and tracing, including alert hygiene, dashboards, and SLO development
  • Incident response, postmortems, and the reliability culture that surrounds them
  • Define and evolve reliability standards, SLIs, SLOs, and error budgets
  • Improve observability, alerting, and incident processes across services
  • Lead high-severity incidents and drive clear, actionable follow-ups
  • Partner with engineering teams to design resilient, scalable systems
  • Build automation to reduce toil and lower operational risk
  • Mentor engineers and influence best practices across teams

Benefits

  • Competitive compensation and equity
  • Unlimited PTO
  • Up to 100% employer covered monthly healthcare premium (medical, dental, vision)
  • Lunch provided via Sharebite, plus dinner for any later office days.
  • Parental leave up to 12 weeks
  • Tax free commuter and parking benefits
  • Voluntary insurances (Life, Hospital, Critical Illness, Accident)
  • Employee Assistance Program (Rightway)
  • 401k
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service