Senior Site Reliability Engineer - Hiring Sprint

Airbyte•San Francisco, CA

1d•$196,000 - $255,000•Hybrid

About The Position

Airbyte is the data and action layer for AI agents, providing fast, accurate, authenticated access to business data across hundreds of sources. The company has raised $181M and is building context infrastructure for production-grade agents. This role is part of an Engineering Hiring Sprint to accelerate hiring. The Site Reliability Engineer will be part of the Data Replication team, responsible for the infrastructure and reliability of a platform that runs over 3 million sync jobs a week. The role involves building and maintaining infrastructure, setting reliability standards, reducing incidents, and improving tooling. Engineers are expected to actively use AI as a force multiplier for tasks like automating toil and augmenting incident response. The company values trust, directness, and craftsmanship.

Requirements

7+ years in infrastructure, platform engineering, SRE, or DevOps.
Hands-on ownership of Kubernetes, Helm, and Terraform in production environments.
Deep experience with observability stacks (Prometheus, Grafana, Datadog) and on-call operations.
Experience with CI/CD pipeline ownership and developer tooling.
Ability & willingness to read backend code to understand how systems break and instrument them correctly.
Fluency with AI tools - LLMs and agentic frameworks to automate, debug faster, and reduce toil.
A startup-ready mindset: comfortable with ambiguity, moving fast, and owning problems end-to-end.

Nice To Haves

Data pipelines, replication systems, or ETL/ELT platforms.
Control plane / data plane architectures or internal developer platforms.
Experience with Airbyte, CDKs, or connector-based architectures.

Responsibilities

Own the infrastructure underpinning the Data Replication platform - Kubernetes clusters, CI/CD pipelines, secrets management, networking, and cloud resource configuration across AWS and GCP.
Partner with product engineers to reliably integrate product features with infrastructure.
Maintain and enhance observability, alerting, and anomaly detection with an eye towards LLM automation.
Maintain and enhance AI-augmented release and internal tooling: canary deployments, progressive rollouts, automated release qualification, and rollback automation - with an eye towards LLM automation.
Set the infrastructure bar for the team - build self-serve tooling, write runbooks, and coach engineers to own more of their stack.