Senior Site Reliability Engineer – Platform

QuizletSan Francisco, CA
14dOnsite

About The Position

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way. Our $1B+ learning platform serves tens of millions of students every month, including two-thirds of U.S. high schoolers and half of U.S. college students, powering over 2 billion learning interactions monthly. We blend cognitive science with machine learning to personalize and enhance the learning experience for students, professionals, and lifelong learners alike. We’re energized by the potential to power more learners through multiple approaches and various tools. Let’s Build the Future of Learning Join us to design and deliver AI-powered learning tools that scale across the world and unlock human potential. As a Senior Site Reliability Engineer, you’ll design and build the automation, observability, and systems architecture that enable Quizlet to scale reliably for the next generation of AI-powered learning. You’ll engineer software, tools, and processes that improve service performance, reduce operational toil, and ensure our platform meets strict SLOs for global learners. We’re happy to share that this is an onsite position in our San Francisco office . To help foster team collaboration, we require that employees be in the office a minimum of three days per week : Monday, Wednesday, and Thursday and as needed by your manager or the company. We believe that this working environment facilitates increased work efficiency, team partnership, and supports growth as an employee and organization.

Requirements

  • 5+ years of experience in SRE, infrastructure, or systems software engineering
  • Strong proficiency in Go and/or Python, with experience automating operational workflows
  • Deep understanding of Kubernetes (GKE), Istio, and distributed systems
  • Hands-on experience with CI/CD systems (GitHub Actions, CircleCI, ArgoCD) and Terraform
  • Expertise in Datadog, incident response, and observability-driven engineering
  • Solid foundation in Linux systems, networking, and GCP or similar cloud environments
  • Proven ability to improve reliability, reduce MTTR, and lead by example

Responsibilities

  • Develop and maintain automation and tooling that ensures 99.95 % uptime under peak load
  • Build self-healing and auto-remediation systems for our Kubernetes clusters (GKE) and service mesh (Istio)
  • Optimize our CI/CD toolchain (GitHub Actions, CircleCI, ArgoCD) to improve deployment velocity and reliability
  • Design and deploy deep observability and diagnostics in Datadog, leading post-incident reviews with Jeli
  • Conduct capacity planning and performance tuning across databases (Spanner, PlanetScale, BigQuery)
  • Collaborate with product engineering to define SLOs and ensure new services are resilient by design

Benefits

  • Collaborate with your manager and team to create a healthy work-life balance
  • 20 vacation days that we expect you to take!
  • Competitive health, dental, and vision insurance (100% employee and 75% dependent PPO, Dental, VSP Choice)
  • Employer-sponsored 401(k) plan with company match
  • Access to LinkedIn Learning and other resources to support professional growth
  • Paid Family Leave, FSA, HSA, Commuter benefits, and Wellness benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service