Site Reliability Engineer, Cloud Infrastructure

Quizlet•San Francisco, CA

29d•$120,000 - $168,488•Onsite

About The Position

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way. Our $1B+ learning platform serves tens of millions of students every month, including two-thirds of U.S. high schoolers and half of U.S. college students, powering over 2 billion learning interactions monthly. We blend cognitive science with machine learning to personalize and enhance the learning experience for students, professionals, and lifelong learners alike. Weâre energized by the potential to power more learners through multiple approaches and various tools. Letâs Build the Future of Learning Join us to design and deliver AI-powered learning tools that scale across the world and unlock human potential. About the Role: We are looking for a Site Reliability Engineer (SRE) to join our infrastructure team and help build reliable, scalable, and efficient systems. As an SRE, you'll blend software engineering expertise with systems knowledge to improve uptime, enhance performance, and reduce operational toil. Weâre happy to share that this is an onsite position in our San Francisco office. To help foster team collaboration, we require that employees be in the office a minimum of three days per week: Monday, Wednesday, and Thursday and as needed by your manager or the company. We believe that this working environment facilitates increased work efficiency, team partnership, and supports growth as an employee and organization.

Requirements

2+ years of professional experience in SRE, DevOps, Platform Engineering, or related infrastructure roles
Previous internship or professional experience writing code in a software development role (backend, full-stack, or similar)
Solid programming skills in languages such as Python, Go, PHP
Familiarity with CI/CD systems and infrastructure-as-code tools (e.g., Terraform, GitHub Actions)
Understanding of Linux systems, networking fundamentals, and cloud-native concepts
A growth mindset with interest in continuous improvement, root cause analysis, and reducing operational burden.
Good communication and collaboration skills; comfortable working across teams

Nice To Haves

Exposure to Kubernetes, container orchestration, or service mesh technologies is a plus
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog) is helpful

Responsibilities

Monitor and maintain the reliability and uptime of our systems and services through effective alerting, incident response, and resilient design patterns
Write automation scripts and tools for deployments, infrastructure management, and operational tasks to reduce manual effort (toil)
Contribute to observability improvements (metrics, logging, tracing) to enhance visibility into systems and applications
Work with product and engineering teams to ensure systems are designed with scalability and resilience in mind
Participate in post-incident reviews and implement action items to prevent recurrence
Support and optimize infrastructure (e.g., Kubernetes, cloud platforms like GCP/AWS)
Learn and apply SRE best practices, including SLOs, SLIs, and error budgets

Benefits

Collaborate with your manager and team to create a healthy work-life balance
20 vacation days that we expect you to take!
Competitive health, dental, and vision insurance (100% employee and 75% dependent PPO, Dental, VSP Choice)
Employer-sponsored 401k plan with company match
Access to LinkedIn Learning and other resources to support professional growth
Paid Family Leave, FSA, HSA, Commuter benefits, and Wellness benefits
40 hours of annual paid time off to participate in volunteer programs of choice

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume