Site Reliability Engineer

Klaviyo GMI•Boston, MA

7h•Hybrid

About The Position

At Klaviyo, we value the unique backgrounds, experiences and perspectives each Klaviyo (we call ourselves Klaviyos) brings to our workplace each and every day. We believe everyone deserves a fair shot at success and appreciate the experiences each person brings beyond the traditional job requirements. If you’re a close but not exact match with the description, we hope you’ll still consider applying. Design and develop systems and processes that enable highly available and scalable systems. Design, build, and deliver software to dramatically improve the availability, scalability, latency, and efficiency of services. Achieve breakthroughs in systems throughput by identifying and eliminating bottlenecks. Champion best practices by actively collaborating with other teams in a culture that values technical design review. Collaborate with other Engineers to build better software by focusing on performance, self-healing systems, configuration as code, defensive programming, and application security. Participate in periodic on-call duties with a focus on resolving issues quickly once discovered, preventing recurrences, and minimizing alert fatigue. Work closely with product-facing Engineers to ship impactful code. Perform quantitative analysis to understand and scale systems and manage the cross-functional efforts to resolve scalability issues. Produce and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies. Support informed, data-driven decision-making in a fast-paced environment with competing priorities. Promote Site Reliability best practices across the Engineering organization. Telecommuting permitted 2 days per week. Multiple positions. Full-time. EEO/fully supports affirmative action practices.

Requirements

Master’s degree in Software Engineering, Computer Engineering, Electrical Engineering, Telecommunication Networks, or a related field and 2 years of experience in an engineering occupation.
24 months of experience in Building software on an engineering team
24 months of experience in Operating and scaling complex distributed systems
24 months of experience in Developing applications in Python, Bash and Shell
24 months of experience in Advanced Python and Bash scripting for building automation tools and workflow orchestration
24 months of experience in Linux (including Ubuntu) and all layers of the networking stack
24 months of experience in Administering and debugging production Linux systems
24 months of experience in Automating infrastructure for cloud environments using AWS CloudFormation and Terraform
24 months of experience in AWS services for secure and scalable infrastructure management, including EC2, IAM, CloudWatch, CloudTrail, S3, or Lambda
24 months of experience in Grafana, Prometheus, filebeat and logstash
24 months of experience in Designing, deploying, and managing containerized applications using Kubernetes and Docker
24 months of experience in Collaborating with cross-functional teams including developers, operations, and security to deliver reliable and secure systems
24 months of experience in Resolving complex systems outages, driving failure to root cause analysis, and preventing future issues.

Responsibilities

Design and develop systems and processes that enable highly available and scalable systems.
Design, build, and deliver software to dramatically improve the availability, scalability, latency, and efficiency of services.
Achieve breakthroughs in systems throughput by identifying and eliminating bottlenecks.
Champion best practices by actively collaborating with other teams in a culture that values technical design review.
Collaborate with other Engineers to build better software by focusing on performance, self-healing systems, configuration as code, defensive programming, and application security.
Participate in periodic on-call duties with a focus on resolving issues quickly once discovered, preventing recurrences, and minimizing alert fatigue.
Work closely with product-facing Engineers to ship impactful code.
Perform quantitative analysis to understand and scale systems and manage the cross-functional efforts to resolve scalability issues.
Produce and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies.
Support informed, data-driven decision-making in a fast-paced environment with competing priorities.
Promote Site Reliability best practices across the Engineering organization.