Senior Site Reliability Engineer
Babylist
·
Posted:
August 4, 2023
·
Remote
About the position
As a Senior Site Reliability Engineer (SRE) at Babylist, you will be responsible for ensuring the stability, scalability, and reliability of the company's systems and services. Your role will involve working closely with all Babylist Engineering teams to support shared infrastructure and developer tools. With your expertise in site reliability engineering, AWS cloud infrastructure, and modern DevOps practices, you will play a crucial role in optimizing systems and driving continuous improvement. This position requires a strong background in maintaining highly available and scalable systems, experience with Ruby/Ruby on Rails, proficiency in Terraform, Docker, and Kubernetes, and a solid understanding of cloud-native systems design. Troubleshooting, debugging, and excellent communication skills are also essential for success in this role.
Responsibilities
- Ensure the stability, scalability, and reliability of Babylist's systems and services as a Senior Site Reliability Engineer (SRE)
- Collaborate with all Babylist Engineering teams to support shared infrastructure and developer tools
- Utilize expertise in site reliability engineering, AWS cloud infrastructure, and modern DevOps practices to optimize systems and drive continuous improvement
- Maintain highly available and scalable systems for high-traffic consumer-facing websites built with Ruby/Ruby on Rails
- Manage and build AWS infrastructure using Infrastructure as Code (IaC) practices, specifically with Terraform
- Ensure the reliability, performance, and security of AWS cloud-based infrastructure and services
- Contribute to the design, deployment, and management of containerized applications using Docker and Kubernetes
- Design cloud-native systems, including CDNs, load balancers, cloud networking, DNS, caching, and distributed systems
- Troubleshoot and debug issues across various environments
- Design and support CI systems such as CircleCI, Jenkins, or GitHub actions
- Implement monitoring and alerting best practices using tools like Datadog, Cronitor, Sentry, and PagerDuty
- Collaborate effectively with cross-functional teams, demonstrating excellent verbal and written communication skills
- Manage and optimize AWS infrastructure, including EKS clusters and databases, to ensure performance and reliability
- Improve the speed and reliability of Continuous Integration (CI) systems to support efficient development and deployment processes
- Provide support to developers and other team members as needed.
Requirements
- 6+ years of experience as a Site Reliability Engineer or similar role, demonstrating a strong background in maintaining highly available and scalable systems
- Experience supporting high-traffic consumer-facing websites built with Ruby/Ruby on Rails, understanding the unique challenges and considerations in maintaining such systems
- Proficiency with Terraform is a must, as you will be a member of the team responsible for managing and building our AWS infrastructure using Infrastructure as Code (IaC) practices
- You possess strong experience working with AWS cloud-based infrastructure and services, ensuring their reliability, performance, and security
- Proficiency with Docker and Kubernetes is essential, as you will contribute to the design, deployment, and management of containerized applications in our environment
- You have a solid understanding of cloud-native systems design, including CDNs, load balancers, cloud networking, DNS, caching, and distributed systems
- Troubleshooting and debugging are second nature to you, allowing you to quickly identify and resolve issues across various environments
- Experience designing and supporting CI systems such as CircleCI, Jenkins, or GitHub actions
- You are familiar with monitoring and alerting best practices, utilizing tools like Datadog, Cronitor, Sentry, and PagerDuty to ensure proactive identification and resolution of issues
- You have excellent verbal and written communication skills, and the ability to collaborate effectively with cross-functional teams
Benefits
- Remote-first company with flexible work hours
- In-person company offsites for community and collaboration
- Competitive pay and opportunities for career advancement
- Company paid medical, dental, and vision insurance
- Generous paid parental leave policy
- 401k with company match
- Perks for physical, mental, and emotional health
- Parenting and childcare benefits
- Financial planning benefits
- Market-based approach to pay
- Equity, bonus, and benefits offered
- Diversity and inclusion encouraged in the team