Senior Site Reliability Engineer (SRE)

Finite State
3h$215,000 - $250,000Remote

About The Position

We are seeking a Senior Site Reliability Engineer (SRE) / Infrastructure Engineering leader to define, architect, and drive a modern observability and reliability strategy for an AI-first development organization. This is a highly impactful technical leadership role responsible for establishing best-in-class operational practices, reliability standards, and AI-enabled infrastructure automation across our product ecosystem. This individual will bring deep experience in reliability engineering, distributed systems, and production operations—along with a forward-thinking mindset around AI-assisted development and infrastructure-as-code. If you are passionate about building resilient systems, defining SLOs that actually matter, and leveraging AI tooling to accelerate operational excellence, this role is for you.

Requirements

  • 10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Production Engineering.
  • Proven experience defining and implementing SLOs, SLAs, SLIs, and error budget frameworks at scale.
  • Deep experience building and managing on-call rotations and incident management processes.
  • Strong background in distributed systems and cloud-native architectures.
  • Hands-on experience with:
  • Honeycomb
  • Grafana
  • AWS
  • Vercel
  • Supabase
  • Strong experience with observability instrumentation and telemetry design.
  • Infrastructure-as-Code experience (e.g., Terraform, Pulumi, or similar).
  • Experience designing resilient CI/CD pipelines.
  • Deep understanding of high-availability, scalability, and performance engineering principles.
  • Demonstrated experience leveraging AI tools (Cursor, Claude, Codex, etc.) in development or infrastructure workflows.
  • Experience using AI-assisted tooling to generate, validate, or optimize infrastructure configurations.
  • Strong interest in building AI-native operational practices.
  • Ability to operate as both strategic architect and hands-on implementer.
  • Strong written and verbal communication skills.
  • Experience influencing cross-functional teams.
  • Comfort working in fast-paced, high-growth environments.

Nice To Haves

  • Experience supporting AI/ML workloads in production.
  • Experience building internal developer platforms (IDP).
  • Experience with cost observability and FinOps practices.
  • Experience scaling observability in high-growth SaaS environments.

Responsibilities

  • Leverage AI tools and Agentic processes to drive observability, quality, responsiveness, and operational clarity.
  • Design modern telemetry pipelines (metrics, logs, traces, events) for distributed systems and AI-driven workloads.
  • Define and implement a comprehensive observability framework across applications and infrastructure.
  • Establish and operationalize meaningful SLIs, SLOs, and SLAs aligned with business objectives.
  • Lead the adoption and optimization of observability tooling including Honeycomb, Grafana, and related telemetry platforms.
  • Drive best practices in error budgeting, alert design, and production health monitoring.
  • Define and evolve incident management processes, including:
  • On-call structures and escalation models
  • Postmortems and blameless retrospectives
  • Runbooks and operational playbooks
  • Improve system reliability, performance, scalability, and cost efficiency.
  • Establish operational KPIs and reliability dashboards for engineering and leadership visibility.
  • Lead reliability reviews for new architecture and product initiatives.
  • Architect and implement scalable cloud infrastructure primarily within AWS.
  • Work closely with modern application platforms such as Vercel and Supabase.
  • Implement and improve Infrastructure-as-Code practices.
  • Leverage AI-assisted tooling to accelerate infrastructure design, validation, and automation.
  • Ensure production-grade security, compliance, and resilience standards.
  • Champion the use of AI tools to:
  • Accelerate infrastructure provisioning
  • Improve operational workflows
  • Enhance observability signal quality
  • Automate incident response and remediation
  • Partner with AI-focused product teams to ensure observability supports model performance, experimentation, and reliability.
  • Serve as a senior technical authority for reliability and infrastructure decisions.
  • Mentor engineers on production best practices.
  • Influence architectural decisions to improve system resilience and maintainability.
  • Drive a culture of reliability, accountability, and continuous improvement.

Benefits

  • equity
  • benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service