Staff Site Reliability Engineer, Release Engineering

Plaid•New York, NY

23h•$207,600 - $273,600

About The Position

Plaid's Infrastructure team builds the platforms and tooling that help engineering teams develop, deploy, and operate production systems safely. Release Engineering owns the path from merge to production, including Plaid's zero-touch deployment system, progressive rollouts, metric-gated analysis, and automatic rollback. Our goal is to make safe shipping the default for every product team. As a Staff Site Reliability Engineer on Release Engineering, you'll define and scale Plaid's reliability practices across product engineering. You'll architect our SLO and error-budget programs, drive the adoption of progressive delivery, and ensure new products are production-ready. By partnering across product and platform teams, you'll translate complex production needs into intuitive, self-service tooling. This is a hands-on technical leadership role where you'll shape the future of our deployment systems—ensuring they remain fast and safe even as AI-assisted development increases code velocity.

Requirements

Over 8 years of professional experience in backend systems, SRE, or platform engineering roles.
Proven track record of designing reliability programs—such as service maturity models or SLI frameworks—that achieved cross-team adoption.
Direct experience building or operating canary rollout systems, metric-gated analysis, or automated rollback infrastructure.
Technical proficiency in software development, with a preference for Go or similar systems languages.
Ability to drive organizational change and influence engineering culture without formal authority.
Sound technical judgment in high-stakes production scenarios, balancing user impact with developer velocity.

Nice To Haves

Prior exposure to Kubernetes, service mesh technologies, Prometheus, or ArgoCD is considered a strong asset.

Responsibilities

Lead the expansion of reliability standards across product engineering, converting foundational infrastructure into lasting operational habits and tooling.
Architect and manage the SLO and error-budget framework, empowering teams to utilize reliability data for strategic product and release choices.
Promote widespread use of progressive delivery and automated safety gates, ensuring high velocity without compromising production stability.
Guide emerging product teams toward production readiness through expertise in observability, incident response, and scalable deployment health.
Collaborate with SRE, Platform, and Infrastructure teams to transform complex production requirements into intuitive, self-service platform features.
Direct the response to critical incidents and ensure the resulting post-mortem actions yield permanent improvements to the platform.
Prepare for an AI-driven development landscape by scaling our safety nets to handle an increased volume and frequency of code changes.