Staff Site Reliability Engineer - Platform

Quizlet•San Francisco, CA

63d•Onsite

About The Position

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way. Our $1B+ learning platform serves tens of millions of students every month, including two-thirds of U.S. high schoolers and half of U.S. college students, powering over 2 billion learning interactions monthly. We blend cognitive science with machine learning to personalize and enhance the learning experience for students, professionals, and lifelong learners alike. We’re energized by the potential to power more learners through multiple approaches and various tools. Let’s Build the Future of Learning Join us to design and deliver AI-powered learning tools that scale across the world and unlock human potential. About the Role As a Staff Site Reliability Engineer , you’ll lead reliability engineering across Quizlet’s platform — designing automation, scaling systems, and ensuring that our infrastructure can support rapid innovation in AI-powered learning. You’ll drive the architectural direction for resilience, observability, and performance while mentoring other engineers and influencing platform-wide standards. We’re happy to share that this is an onsite position in our San Francisco office. To help foster team collaboration, we require that employees be in the office a minimum of three days per week : Monday, Wednesday, and Thursday and as needed by your manager or the company. We believe that this working environment facilitates increased work efficiency, team partnership, and supports growth as an employee and organization.

Requirements

8+ years of experience in SRE, systems, or infrastructure engineering
Expertise in Kubernetes (GKE), Terraform, and CI/CD pipelines (ArgoCD, GitHub Actions, CircleCI)
Deep programming skills in Go and/or Python for infrastructure automation
Strong experience in Datadog, system monitoring, and distributed tracing
Familiarity with GCP services, Linux internals, and large-scale networking
Proven experience leading cross-team reliability initiatives and architectural improvements

Responsibilities

Lead the design and implementation of self-healing, auto-scaling infrastructure across our Kubernetes and Istio environments
Architect and implement CI/CD reliability improvements that reduce MTTR and deployment risk
Partner with teams to define and enforce SLOs and operational excellence standards
Build systems and tools that enable proactive reliability and capacity management
Drive incident analysis and postmortems using Datadog and Jeli to identify architectural improvements
Mentor engineers and establish best practices for automation, observability, and scaling

Benefits

Collaborate with your manager and team to create a healthy work-life balance
20 vacation days that we expect you to take!
Competitive health, dental, and vision insurance (100% employee and 75% dependent PPO, Dental, VSP Choice)
Employer-sponsored 401(k) plan with company match
Access to LinkedIn Learning and other resources to support professional growth
Paid Family Leave, FSA, HSA, Commuter benefits, and Wellness benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume