Site Reliability Engineer (SRE) — Monstro US

MonstroNew York, NY
$142,000 - $214,700Onsite

About The Position

Monstro is building a secure, multi-tenant platform on Google Cloud, and we’re hiring a Site Reliability Engineer to own the reliability and observability of that platform end-to-end. This is a hands-on role for someone who wants to do real SRE work - not a rebrand of L1 support. You’ll write the dashboards, define the SLOs, build the automation that kills toil, and take your turn on the on-call rotation that proves it all works. When something breaks at 2 AM, you’re the person who keeps it running; when nothing’s breaking, you’re the person making sure the next break is smaller, shorter, or doesn’t happen at all.

Requirements

  • Solid production experience on GCP (or comparable AWS/Azure depth with willingness to ramp on GCP fast)
  • Comfortable on-call: you’ve run incidents, written postmortems, and shipped the action items
  • Strong observability fundamentals: SLOs, log-based metrics, alert hygiene, dashboard discipline
  • Working knowledge of Kubernetes, API gateways, identity systems, and at least one IaC tool
  • Scripting / coding fluency (Python, Go, Bash) for automation and tooling
  • Good written communication — handoffs, postmortems, and runbooks are part of the job
  • Bias toward fixing the system, not the symptoms

Nice To Haves

  • Apigee or another enterprise API gateway in production
  • BigQuery for log analytics or audit
  • Experience standing up observability from scratch, not just maintaining inherited dashboards
  • SOC2 or similar compliance environments

Responsibilities

  • Define and maintain SLOs and SLIs for our tier-1 services: API gateway, application services, identity, and edge availability
  • Build canonical dashboards and alerts in Google Cloud Monitoring, backed by structured logs and BigQuery log analytics
  • Tune alert routing so every page is actionable — kill the rest
  • Instrument services for distributed tracing and structured logging; push back on services that ship without it
  • Own error budgets and use them to prioritize reliability work over feature work when burned
  • Reduce toil: automate the top recurring page from the previous quarter
  • Maintain runbooks so every page maps to one within a cycle of first occurrence
  • First responder for production alerts across monitoring, API gateway, edge defense, and CI
  • Triage severity, run the incident bridge, drive mitigation (revision rollback, traffic shift, scaling, edge block, credential rotation)
  • Own internal and external incident comms during your shift
  • Drive postmortems to closure with action items tracked as audit evidence
  • Clean written handoffs at end of shift

Benefits

  • Competitive salary
  • Equity
  • Robust benefits package
  • Paid health coverage
  • Vision coverage
  • Dental coverage
  • Disability coverage
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service