Site Reliability Engineer

Klaviyo GMI•Boston, MA

8d•$131,082 - $174,000•Hybrid

About The Position

At Klaviyo, we value the unique backgrounds, experiences and perspectives each Klaviyo (we call ourselves Klaviyos) brings to our workplace each and every day. We believe everyone deserves a fair shot at success and appreciate the experiences each person brings beyond the traditional job requirements. If you’re a close but not exact match with the description, we hope you’ll still consider applying. Design and develop systems and processes that enable highly available and scalable systems. Design, build, and deliver software to dramatically improve the availability, scalability, latency, and efficiency of services. Achieve breakthroughs in systems throughput by identifying and eliminating bottlenecks. Champion best practices by actively collaborating with other teams in a culture that values technical design review. Collaborate with other Engineers to build better software by focusing on performance, self-healing systems, configuration as code, defensive programming, and application security. Participate in periodic on-call duties with a focus on resolving issues quickly once discovered, preventing recurrences, and minimizing alert fatigue. Work closely with product-facing Engineers to ship impactful code. Perform quantitative analysis to understand and scale systems and manage the cross-functional efforts to resolve scalability issues. Produce and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies. Support informed, data-driven decision-making in a fast-paced environment with competing priorities. Promote Site Reliability best practices across the Engineering organization. Telecommuting permitted 2 days per week. Multiple positions. Full-time. EEO/fully supports affirmative action practices.

Requirements

Master’s degree in Software Engineering, Computer Engineering, Electrical Engineering, Telecommunication Networks, or a related field and 2 years of experience in an engineering occupation.
Building software on an engineering team
Operating and scaling complex distributed systems
Developing applications in Python, Bash and Shell
Advanced Python and Bash scripting for building automation tools and workflow orchestration
Linux (including Ubuntu) and all layers of the networking stack
Administering and debugging production Linux systems
Automating infrastructure for cloud environments using AWS CloudFormation and Terraform
AWS services for secure and scalable infrastructure management, including EC2, IAM, CloudWatch, CloudTrail, S3, or Lambda
Grafana, Prometheus, filebeat and logstash
Designing, deploying, and managing containerized applications using Kubernetes and Docker
Collaborating with cross-functional teams including developers, operations, and security to deliver reliable and secure systems
Resolving complex systems outages, driving failure to root cause analysis, and preventing future issues

Responsibilities

Design and develop systems and processes that enable highly available and scalable systems.
Design, build, and deliver software to dramatically improve the availability, scalability, latency, and efficiency of services.
Achieve breakthroughs in systems throughput by identifying and eliminating bottlenecks.
Champion best practices by actively collaborating with other teams in a culture that values technical design review.
Collaborate with other Engineers to build better software by focusing on performance, self-healing systems, configuration as code, defensive programming, and application security.
Participate in periodic on-call duties with a focus on resolving issues quickly once discovered, preventing recurrences, and minimizing alert fatigue.
Work closely with product-facing Engineers to ship impactful code.
Perform quantitative analysis to understand and scale systems and manage the cross-functional efforts to resolve scalability issues.
Produce and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies.
Support informed, data-driven decision-making in a fast-paced environment with competing priorities.
Promote Site Reliability best practices across the Engineering organization.