Sr. Site Reliability Engineer

PinterestToronto, ON
Hybrid

About The Position

The Site Reliability Engineering organization at Pinterest is accountable for ensuring overall Pinterest availability as well as enhancing Engineering teams’ capability to design, build and operate robust systems at scale. We are hiring a Sr. SRE to join our Compute SRE team. This team is responsible for ensuring that all compute workloads run smoothly on Pinterest. We're building the future on kubernetes and our job is to connect it with what Pinterest needs. Pinterest’s applications and infrastructure that handle billions of monthly page views and petabytes of data as Pinterest continues to grow and scale. As a Pinterest SRE, you will design and build systems, platforms, tools, frameworks and methodologies to assure the reliability of our large-scale distributed systems.

Requirements

  • Strong knowledge of Kubernetes (specially EKS), including deploy patterns, rollout safety, and core debugging workflows
  • 4+ years of experience with programming languages (Python or Golang preferred)
  • Strong experience managing projects and initiatives end-to-end
  • Hands-on experience with AI-assisted development tools such as Cursor, GitHub Copilot or Claude for code generation, debugging, and documentation
  • Demonstrated ability to write effective prompts to get high-quality, reliable outputs from LLMs
  • Demonstrated ability to use AI to improve speed and quality in your day-to-day workflow for relevant outputs.
  • Strong track record of critical evaluation and verification of AI-assisted work (e.g., testing, source-checking, data validation, peer review)
  • High integrity and ownership: you protect sensitive data, avoid over-reliance on AI, and remain accountable for final decisions and deliverables
  • Experience with technologies such as Terraform, Buildkite, and/or ArgoCD is required
  • Bachelor’s or Master’s degree in a relevant field such as Computer Science, or equivalent experience

Responsibilities

  • Tackle project challenges on EKS, such as implementing Karpenter. This work affects how every developer codes, tests, and improves their work
  • Collaborate across various teams to drive projects forward using open-source tools
  • Build a deep understanding of how Pinterest’s systems behave, scale, interact and fail, and use that insight to identity risks and opportunities for remediation
  • Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes and best practices to be used across Pinterest Engineering
  • Build meaningful, insightful and actionable SLIs
  • Automate critical portions of Pinterest’s engineering processes, to minimize risk and maximize the speed of innovation
  • Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world
  • Use AI for analysis of incidents, operational signals, and system behaviors to help identify patterns and generate plans and propose remediation approaches.
  • Leverage AI to speed development of runbooks, automation workflows, reliability tooling by drafting, iterating, and refining approaches.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service