Site Reliability Engineer

Klaviyo GMIBoston, MA
$131,082 - $174,000Hybrid

About The Position

At Klaviyo, we value the unique backgrounds, experiences and perspectives each Klaviyo (we call ourselves Klaviyos) brings to our workplace each and every day. We believe everyone deserves a fair shot at success and appreciate the experiences each person brings beyond the traditional job requirements. If you’re a close but not exact match with the description, we hope you’ll still consider applying. Design and develop systems and processes that enable highly available and scalable systems. Design, build, and deliver software to dramatically improve the availability, scalability, latency, and efficiency of services. Achieve breakthroughs in systems throughput by identifying and eliminating bottlenecks. Champion best practices by actively collaborating with other teams in a culture that values technical design review. Collaborate with other Engineers to build better software by focusing on performance, self-healing systems, configuration as code, defensive programming, and application security. Participate in periodic on-call duties with a focus on resolving issues quickly once discovered, preventing recurrences, and minimizing alert fatigue. Work closely with product-facing Engineers to ship impactful code. Perform quantitative analysis to understand and scale systems and manage the cross-functional efforts to resolve scalability issues. Produce and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies. Support informed, data-driven decision-making in a fast-paced environment with competing priorities. Promote Site Reliability best practices across the Engineering organization. Telecommuting permitted 2 days per week. Multiple positions. Full-time. EEO/fully supports affirmative action practices.

Requirements

  • Master’s degree in Software Engineering, Computer Engineering, Electrical Engineering, Telecommunication Networks, or a related field and 2 years of experience in an engineering occupation.
  • Building software on an engineering team
  • Operating and scaling complex distributed systems
  • Developing applications in Python, Bash and Shell
  • Advanced Python and Bash scripting for building automation tools and workflow orchestration
  • Linux (including Ubuntu) and all layers of the networking stack
  • Administering and debugging production Linux systems
  • Automating infrastructure for cloud environments using AWS CloudFormation and Terraform
  • AWS services for secure and scalable infrastructure management, including EC2, IAM, CloudWatch, CloudTrail, S3, or Lambda
  • Grafana, Prometheus, filebeat and logstash
  • Designing, deploying, and managing containerized applications using Kubernetes and Docker
  • Collaborating with cross-functional teams including developers, operations, and security to deliver reliable and secure systems
  • Resolving complex systems outages, driving failure to root cause analysis, and preventing future issues

Responsibilities

  • Design and develop systems and processes that enable highly available and scalable systems.
  • Design, build, and deliver software to dramatically improve the availability, scalability, latency, and efficiency of services.
  • Achieve breakthroughs in systems throughput by identifying and eliminating bottlenecks.
  • Champion best practices by actively collaborating with other teams in a culture that values technical design review.
  • Collaborate with other Engineers to build better software by focusing on performance, self-healing systems, configuration as code, defensive programming, and application security.
  • Participate in periodic on-call duties with a focus on resolving issues quickly once discovered, preventing recurrences, and minimizing alert fatigue.
  • Work closely with product-facing Engineers to ship impactful code.
  • Perform quantitative analysis to understand and scale systems and manage the cross-functional efforts to resolve scalability issues.
  • Produce and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies.
  • Support informed, data-driven decision-making in a fast-paced environment with competing priorities.
  • Promote Site Reliability best practices across the Engineering organization.

Benefits

  • annual cash bonus plan
  • comprehensive range of health, welfare, and wellbeing benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service