Senior Site Reliability Engineer – Platform

Quizlet•San Francisco, CA

14d•Onsite

About The Position

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way. Our $1B+ learning platform serves tens of millions of students every month, including two-thirds of U.S. high schoolers and half of U.S. college students, powering over 2 billion learning interactions monthly. We blend cognitive science with machine learning to personalize and enhance the learning experience for students, professionals, and lifelong learners alike. We’re energized by the potential to power more learners through multiple approaches and various tools. Let’s Build the Future of Learning Join us to design and deliver AI-powered learning tools that scale across the world and unlock human potential. As a Senior Site Reliability Engineer, you’ll design and build the automation, observability, and systems architecture that enable Quizlet to scale reliably for the next generation of AI-powered learning. You’ll engineer software, tools, and processes that improve service performance, reduce operational toil, and ensure our platform meets strict SLOs for global learners. We’re happy to share that this is an onsite position in our San Francisco office . To help foster team collaboration, we require that employees be in the office a minimum of three days per week : Monday, Wednesday, and Thursday and as needed by your manager or the company. We believe that this working environment facilitates increased work efficiency, team partnership, and supports growth as an employee and organization.

Requirements

5+ years of experience in SRE, infrastructure, or systems software engineering
Strong proficiency in Go and/or Python, with experience automating operational workflows
Deep understanding of Kubernetes (GKE), Istio, and distributed systems
Hands-on experience with CI/CD systems (GitHub Actions, CircleCI, ArgoCD) and Terraform
Expertise in Datadog, incident response, and observability-driven engineering
Solid foundation in Linux systems, networking, and GCP or similar cloud environments
Proven ability to improve reliability, reduce MTTR, and lead by example

Responsibilities

Develop and maintain automation and tooling that ensures 99.95 % uptime under peak load
Build self-healing and auto-remediation systems for our Kubernetes clusters (GKE) and service mesh (Istio)
Optimize our CI/CD toolchain (GitHub Actions, CircleCI, ArgoCD) to improve deployment velocity and reliability
Design and deploy deep observability and diagnostics in Datadog, leading post-incident reviews with Jeli
Conduct capacity planning and performance tuning across databases (Spanner, PlanetScale, BigQuery)
Collaborate with product engineering to define SLOs and ensure new services are resilient by design

Benefits

Collaborate with your manager and team to create a healthy work-life balance
20 vacation days that we expect you to take!
Competitive health, dental, and vision insurance (100% employee and 75% dependent PPO, Dental, VSP Choice)
Employer-sponsored 401(k) plan with company match
Access to LinkedIn Learning and other resources to support professional growth
Paid Family Leave, FSA, HSA, Commuter benefits, and Wellness benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume