Senior Site Reliability Engineer (SRE)

Finite State

3h•$215,000 - $250,000•Remote

About The Position

We are seeking a Senior Site Reliability Engineer (SRE) / Infrastructure Engineering leader to define, architect, and drive a modern observability and reliability strategy for an AI-first development organization. This is a highly impactful technical leadership role responsible for establishing best-in-class operational practices, reliability standards, and AI-enabled infrastructure automation across our product ecosystem. This individual will bring deep experience in reliability engineering, distributed systems, and production operations—along with a forward-thinking mindset around AI-assisted development and infrastructure-as-code. If you are passionate about building resilient systems, defining SLOs that actually matter, and leveraging AI tooling to accelerate operational excellence, this role is for you.

Requirements

10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering.
Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale.
Deep experience building and managing on-call rotations and incident management processes.
Strong background in distributed systems and cloud-native architectures.
Hands-on experience with:
Honeycomb
Grafana
AWS
Vercel
Supabase
Strong experience with observability instrumentation and telemetry design.
Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar).
Experience designing resilient CI/CD pipelines.
Deep understanding of high-availability, scalability, and performance engineering principles.
Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows.
Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations.
Strong interest in building AI-native operational practices.
Ability to operate as both strategic architect and hands-on implementer.
Strong written and verbal communication skills.
Experience influencing cross-functional teams.
Comfort working in fast-paced, high-growth environments.

Nice To Haves

Experience supporting AI/ML workloads in production.
Experience building internal developer platforms (IDP).
Experience with cost observability and FinOps practices.
Experience scaling observability in high-growth SaaS environments.

Responsibilities

Leverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity.
Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads.
Define and implement a comprehensive observability framework across applications and infrastructure.
Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives.
Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms.
Drive best practices in error budgeting, alert design, and production health monitoring.
Define and evolve incident management processes, including:
On-call structures and escalation models
Postmortems and blameless retrospectives
Runbooks and operational playbooks
Improve system reliability, performance, scalability, and cost efficiency.
Establish operational KPIs and reliability dashboards for engineering and leadership visibility.
Lead reliability reviews for new architecture and product initiatives.
Architect and implement scalable cloud infrastructure primarily within AWS.
Work closely with modern application platforms such as Vercel and Supabase.
Implement and improve Infrastructure-as-Code practices.
Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation.
Ensure production-grade security, compliance, and resilience standards.
Champion the use of AI tools to:
Accelerate infrastructure provisioning
Improve operational workflows
Enhance observability signal quality
Automate incident response and remediation
Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability.
Serve as a senior technical authority for reliability and infrastructure decisions.
Mentor engineers on production best practices.
Influence architectural decisions to improve system resilience and maintainability.
Drive a culture of reliability, accountability, and continuous improvement.