Site Reliability Engineer

Sitetracker

56d

About The Position

This is your chance to build a reliability practice from the ground up and establish the engineering standards—including SLOs, error budgets, and observability—that will protect our platform as we scale for enterprise customers and expand our AI capabilities. You’ll have the autonomy to set the strategy and the trust to execute it, ensuring that our AI workloads (Evals, RAG, and LLM processing) meet the highest reliability standards. If you are a proactive problem solver who treats toil as an engineering challenge and wants the agency to decide which technologies to adopt and when, you will find this to be a career-defining role. As a Staff or Senior Staff SRE, you’ll hit the ground running by partnering with the engineers currently managing reliability to transition the organization from reactive firefighting to a proactive, disciplined reliability practice. You will lead the deliberate evolution of our infrastructure, recognizing the inflection point for new tooling and leading migrations away from manual scripts and templates only when they’ve earned their keep. Whether you are architecting incident response structures or solving novel reliability problems for AI agents, your work will act as a multiplier that empowers the entire engineering team. By bringing a consulting mindset to every challenge, you’ll propose technical trade-offs based on evidence rather than reflex, ensuring our roadmap for multi-region or service mesh adoption is built for tomorrow. You won't just be handed tasks; you will own the strategy for production-readiness and deploy safety, building the organizational trust needed to make reliability a core differentiator of our product.

Requirements

Deep SRE Expertise
Define SLIs and SLOs for critical user journeys and use them to drive proactive engineering decisions.
Run live production incident response as an Incident Commander and lead blameless postmortems that result in shipped follow-up actions.
Builds observability that tells a story -- dashboards that explain a system's behavior to someone seeing it for the first time -- and actionable alerts.
Take an organization from reactive firefighting to a working reliability practice with measurable improvements in paging volume.
Design error-budget policies and use them to make data-driven trade-offs between shipping features and maintaining reliability.
Deep Technical Expertise in AWS
Designs and operates services on AWS competently — VPC, IAM, compute (ECS/EC2/Lambda), managed data services, and load balancing.
Navigate our current setup of CloudFormation and bash scripts via GitHub Actions effectively without reaching for Terraform reflexively.
Debug production AWS issues at the network and IAM level without escalating to AWS support as a first step.
Design and roll out production workloads across multiple regions and countries while accounting for data residency and regional failure modes.
Lead high-stakes tooling migrations into established environments and manage the long-term consequences of those architectural choices.
Impact, Leadership & Team Enablement
Mentor engineers through pair debugging, postmortem coaching, and runbook reviews to leave the team more capable.
Define alerts for impactful metrics and write the clear, actionable runbooks that go with them.
Work with engineering teams to gather requirements for new infrastructure and conduct constructive production-readiness reviews.
Teach teams how to build their own observability dashboards, raising the technical floor across the entire organization.
Use AI tooling aggressively, including coding agents and log analysis, to accelerate the delivery of impactful changes.
Communication & Influence
Communicate scheduled downtime and infrastructure changes to stakeholders proactively with clear timing and expected impact.
Write postmortems that both engineers and non-engineers can read, understand, and learn from.
Act as the recognized Subject Matter Expert for AWS-related questions across the engineering organization.
Influence product and engineering roadmap decisions by using data and evidence rather than opinion when reliability is a factor.
Build organizational trust so that teams seek out the SRE practice early in the development cycle to make their work better.

Nice To Haves

Fully onboard and partner with the engineers currently managing reliability to review and revise the existing operational plan.
Operationalize high-leverage items to transition the team out of reactive firefighting and into a more stable, proactive state.
Establish a baseline for current system behavior by identifying the most critical user journeys that require immediate SLI/SLO definitions.
Independently drive the revised reliability plan, ensuring SLIs/SLOs are in place and actively used to guide engineering decisions.
Standardize the incident response structure, including severity definitions, Incident Commander roles, and a cadence for blameless postmortems.
Measurably reduce paging volume and ensure that every alert that pages an engineer is backed by a clear, effective runbook.
Establish a mature reliability practice where production-readiness reviews and error-budget conversations are default parts of the development lifecycle.
Define a clear, evidence-based tooling roadmap for the next phase of our evolution, such as Terraform, service mesh, or multi-region expansion.
Serve as an organizational multiplier, having built the observability and culture necessary for other engineers to reason about reliability without constant supervision.

Responsibilities

Define SLIs and SLOs for critical user journeys and use them to drive proactive engineering decisions.
Run live production incident response as an Incident Commander and lead blameless postmortems that result in shipped follow-up actions.
Builds observability that tells a story -- dashboards that explain a system's behavior to someone seeing it for the first time -- and actionable alerts.
Take an organization from reactive firefighting to a working reliability practice with measurable improvements in paging volume.
Design error-budget policies and use them to make data-driven trade-offs between shipping features and maintaining reliability.
Designs and operates services on AWS competently — VPC, IAM, compute (ECS/EC2/Lambda), managed data services, and load balancing.
Navigate our current setup of CloudFormation and bash scripts via GitHub Actions effectively without reaching for Terraform reflexively.
Debug production AWS issues at the network and IAM level without escalating to AWS support as a first step.
Design and roll out production workloads across multiple regions and countries while accounting for data residency and regional failure modes.
Lead high-stakes tooling migrations into established environments and manage the long-term consequences of those architectural choices.
Mentor engineers through pair debugging, postmortem coaching, and runbook reviews to leave the team more capable.
Define alerts for impactful metrics and write the clear, actionable runbooks that go with them.
Work with engineering teams to gather requirements for new infrastructure and conduct constructive production-readiness reviews.
Teach teams how to build their own observability dashboards, raising the technical floor across the entire organization.
Use AI tooling aggressively, including coding agents and log analysis, to accelerate the delivery of impactful changes.
Communicate scheduled downtime and infrastructure changes to stakeholders proactively with clear timing and expected impact.
Write postmortems that both engineers and non-engineers can read, understand, and learn from.
Act as the recognized Subject Matter Expert for AWS-related questions across the engineering organization.
Influence product and engineering roadmap decisions by using data and evidence rather than opinion when reliability is a factor.
Build organizational trust so that teams seek out the SRE practice early in the development cycle to make their work better.
Fully onboard and partner with the engineers currently managing reliability to review and revise the existing operational plan.
Operationalize high-leverage items to transition the team out of reactive firefighting and into a more stable, proactive state.
Establish a baseline for current system behavior by identifying the most critical user journeys that require immediate SLI/SLO definitions.
Independently drive the revised reliability plan, ensuring SLIs/SLOs are in place and actively used to guide engineering decisions.
Standardize the incident response structure, including severity definitions, Incident Commander roles, and a cadence for blameless postmortems.
Measurably reduce paging volume and ensure that every alert that pages an engineer is backed by a clear, effective runbook.
Establish a mature reliability practice where production-readiness reviews and error-budget conversations are default parts of the development lifecycle.
Define a clear, evidence-based tooling roadmap for the next phase of our evolution, such as Terraform, service mesh, or multi-region expansion.
Serve as an organizational multiplier, having built the observability and culture necessary for other engineers to reason about reliability without constant supervision.