Sr. Site Reliability Engineer

Pinterest•Toronto, ON

62d•Hybrid

About The Position

The Site Reliability Engineering organization at Pinterest is accountable for ensuring overall Pinterest availability as well as enhancing Engineering teams’ capability to design, build and operate robust systems at scale. We are hiring a Sr. SRE to join our Compute SRE team. This team is responsible for ensuring that all compute workloads run smoothly on Pinterest. We're building the future on kubernetes and our job is to connect it with what Pinterest needs. Pinterest’s applications and infrastructure that handle billions of monthly page views and petabytes of data as Pinterest continues to grow and scale. As a Pinterest SRE, you will design and build systems, platforms, tools, frameworks and methodologies to assure the reliability of our large-scale distributed systems.

Requirements

Strong knowledge of Kubernetes (specially EKS), including deploy patterns, rollout safety, and core debugging workflows
4+ years of experience with programming languages (Python or Golang preferred)
Strong experience managing projects and initiatives end-to-end
Hands-on experience with AI-assisted development tools such as Cursor, GitHub Copilot or Claude for code generation, debugging, and documentation
Demonstrated ability to write effective prompts to get high-quality, reliable outputs from LLMs
Demonstrated ability to use AI to improve speed and quality in your day-to-day workflow for relevant outputs.
Strong track record of critical evaluation and verification of AI-assisted work (e.g., testing, source-checking, data validation, peer review)
High integrity and ownership: you protect sensitive data, avoid over-reliance on AI, and remain accountable for final decisions and deliverables
Experience with technologies such as Terraform, Buildkite, and/or ArgoCD is required
Bachelor’s or Master’s degree in a relevant field such as Computer Science, or equivalent experience

Responsibilities

Tackle project challenges on EKS, such as implementing Karpenter. This work affects how every developer codes, tests, and improves their work
Collaborate across various teams to drive projects forward using open-source tools
Build a deep understanding of how Pinterest’s systems behave, scale, interact and fail, and use that insight to identity risks and opportunities for remediation
Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes and best practices to be used across Pinterest Engineering
Build meaningful, insightful and actionable SLIs
Automate critical portions of Pinterest’s engineering processes, to minimize risk and maximize the speed of innovation
Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world
Use AI for analysis of incidents, operational signals, and system behaviors to help identify patterns and generate plans and propose remediation approaches.
Leverage AI to speed development of runbooks, automation workflows, reliability tooling by drafting, iterating, and refining approaches.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume