Senior Site Reliability Engineer

Duolingo•Pittsburgh, PA

About The Position

Our mission at Duolingo is to develop the best education in the world and make it universally available. It’s a big mission, and that’s where you come in! At Duolingo, you’ll join a team that cares about finding innovative solutions to complex technical problems , running countless experiments (300+ at a time!) with our massive user base to make data-driven decisions, and educating our users and employees alike. You’ll have limitless learning opportunities, mentorship and collaboration with world-class minds, and a variety of projects with large scopes — while doing work that’s both fun and meaningful. Join our life-changing mission to develop education for our half a billion (and growing!) learners around the world. About the role As a Senior Site Reliability Engineer, you will work closely with both product and platform engineering teams to ensure Duolingo’s sophisticated distributed systems and products are built and maintained with extraordinary quality, and operated in measurable and scalable ways.

Requirements

3+ years of experience within site reliability engineering/DevOps of a product with millions of users
Experience identifying and solving issues in large-scale distributed systems
Experience with Java, Kotlin, Python or Go
Proficiency in networking protocols, such as TCP/IP, HTTP, SSL, DNS, etc
An understanding of containerization toolsets and container orchestration technologies (Docker, Mesos, Kubernetes, Nomad, etc)

Nice To Haves

Experience in improving automation and tooling to reduce service maintenance toil
Proven experience driving improvements to incident response processes
Experience assessing reliability and troubleshooting issues in MySQL and/or PostgreSQL databases

Responsibilities

Collaborate with internal teams to identify sources of instability in distributed systems and drive operational excellence
Own core infrastructure (i.e understand, diagnose, and debug these systems in production)
Provide system design consulting, develop software platforms/frameworks, and conduct launch reviews and root cause analysis
Maintain and document sustainable postmortem/incident response practices
Advocate for and implement changes that improve reliability, scalability, and velocity
Reduce the burden of toil with iterative development of tooling and automation
Collaborate with engineering teams to release new features and become an authority on our services

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume