AI DevOps & Reliability Engineer

Branch Metrics•Vancouver, BC

17h•CA$123,000 - CA$160,000•Remote

About The Position

At Branch, we power every touchpoint with links that work and insights that prove it. From click to conversion, we make growth measurable. Our unparalleled attribution, backed by AI-enhanced linking, is trusted to deliver seamless experiences that increase ROI, decrease wasted spend, and eliminate siloed attribution. We bring the same rigor to how we build our team, by empowering our people to move fast, own outcomes, and build something that matters. We take pride in making meaningful investments in our team’s health, wealth, and growth so individuals can thrive as we scale. Our culture values smart, humble, and collaborative teammates who take accountability and drive results in an environment where their work truly moves the business forward. We are innovative, scaling with purpose, and led by seasoned leaders who know how to build enduring companies. Trusted by brands like Instacart, Western Union, NBCUniversal, ZocDoc, and Sephora, we’re big enough to matter, small enough for you to make a real impact. If you’re excited by the grit of building, rapid learning, and shaping the future of customer growth, you’ll find your place here. About The Group We're hiring an AI DevOps & Reliability Engineer to own how software ships and runs at Branch. The role has two areas: half central platform and standards work, half embedded with an engineering team. Centrally, you'll build and operate the delivery platform (CI/CD pipelines, deployment automation, environments) so teams can release safely, frequently, and on demand. Embedded, you'll work hands-on with an engineering team day-to-day on their infrastructure, deployment, and operational practices, mentoring them and building their capability over time. You'll also lead the adoption of AI in DevOps and SRE work at Branch. Bringing modern AI tooling (Claude Code, agentic workflows) into runbook generation, alerting, incident response, and operational tooling is a core part of this role, not a side project. It's a strategic direction we're committed to. As a lead, you'll work directly with engineering leadership to shape the operations and delivery roadmap across multiple milestones.

Requirements

Hands-on experience adopting AI into DevOps and SRE practices (Claude Code, Cursor, agents, or similar) to improve automation, debugging, and operational efficiency.
7+ years in DevOps, platform, infrastructure, or related engineering roles, ideally in fast-scaling environments.
Strong hands-on Kubernetes and AWS experience.
Deep IaC experience (Terraform and/or CloudFormation) and the ability to set IaC standards for other teams.
Proven CI/CD architecture experience: pipelines, quality gates, release automation.
GitOps experience with Argo CD (or Flux) for Kubernetes delivery.
Hands-on experience operating streaming infrastructure (Kafka) in production.
Experience managing SQL and NoSQL datastores at high volume: performance, scaling, operational health.
Solid scripting/automation skills (Python, Bash, or similar).
Working knowledge of observability stacks: Prometheus, Grafana, PagerDuty (Loki / Alertmanager a plus).
Familiarity with on-call, incident response, SLI/SLO definition, and runbooks, and the operational practices that support them.
Strong collaborator and communicator. Comfortable working across teams, mentoring engineers, and driving alignment without authority.

Nice To Haves

Progressive delivery (canary, blue/green) and feature-flag-driven release experience.
Cost / efficiency awareness in cloud infrastructure.
Broader data / streaming ecosystem exposure (Spark, schema management, CDC, etc.).

Responsibilities

Design and expand deployment automation, advancing the org toward on-demand and continuous production releases.
Establish release practices and standards: progressive delivery, rollback, release tracking, deployment inventory teams can trust.
Extend automation deeper into production paths, reducing manual steps and release toil.
Enable verification through automation: quality gates as code, build engineering supports our efforts.
Own CI/CD standards across teams: quality gates, automated checks, guardrails that catch problems before production.
Build pipeline tooling that makes the safe path the easy path for engineers.
Design and build out dev, staging, and on-demand (ephemeral) environments that mirror production and spin up on request.
Treat environment provisioning as a product: fast, reproducible, self-service.
Bring AI tooling into operations: automated runbook generation, intelligent alerting, AI-assisted incident response, operational tooling.
Help build an org-wide, AI-augmented ops practice and share patterns across teams.
Champion Infrastructure as Code (Terraform / CloudFormation) for provisioning, configuration, and lifecycle management.
Drive GitOps-based delivery with Argo CD for secure, repeatable, scalable deployments across Kubernetes.
Bring a strong reliability foundation: alerting practices, on-call, runbooks, SLI/SLO definition, incident response.
Partner with engineering teams on the operational practices that keep their services healthy at high volume.
Operate and tune high-volume data infrastructure: streaming pipelines (Kafka) and SQL/NoSQL datastores under heavy production load.
Strengthen team-level runbooks, operational readiness, and production hygiene; feed improvements back into the platform.
Embed with an assigned engineering team day-to-day, working hands-on with them on infrastructure, deployment, and reliability work.
Mentor team engineers on operational best practices, observability, and reliability.
Help build the team's capability over time so good practices stick.
Stand up DORA metrics (lead time, deployment frequency, change failure rate, MTTR) and use them to target real improvements.
Make delivery and reliability health visible to teams and leadership.
Work with engineering leadership on the operations and delivery roadmap.
Drive cross-team adoption of standards and tooling through collaboration and influence.