Site Reliability Engineer

Klaviyo GMIBoston, MA
Hybrid

About The Position

At Klaviyo, we value the unique backgrounds, experiences and perspectives each Klaviyo (we call ourselves Klaviyos) brings to our workplace each and every day. We believe everyone deserves a fair shot at success and appreciate the experiences each person brings beyond the traditional job requirements. If you’re a close but not exact match with the description, we hope you’ll still consider applying. Design and develop systems and processes that enable highly available and scalable systems. Design, build, and deliver software to dramatically improve the availability, scalability, latency, and efficiency of services. Achieve breakthroughs in systems throughput by identifying and eliminating bottlenecks. Champion best practices by actively collaborating with other teams in a culture that values technical design review. Collaborate with other Engineers to build better software by focusing on performance, self-healing systems, configuration as code, defensive programming, and application security. Participate in periodic on-call duties with a focus on resolving issues quickly once discovered, preventing recurrences, and minimizing alert fatigue. Work closely with product-facing Engineers to ship impactful code. Perform quantitative analysis to understand and scale systems and manage the cross-functional efforts to resolve scalability issues. Produce and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies. Support informed, data-driven decision-making in a fast-paced environment with competing priorities. Promote Site Reliability best practices across the Engineering organization. Telecommuting permitted 2 days per week. Multiple positions. Full-time. EEO/fully supports affirmative action practices.

Requirements

  • Master’s degree in Software Engineering, Computer Engineering, Electrical Engineering, Telecommunication Networks, or a related field and 2 years of experience in an engineering occupation.
  • 24 months of experience in Building software on an engineering team
  • 24 months of experience in Operating and scaling complex distributed systems
  • 24 months of experience in Developing applications in Python, Bash and Shell
  • 24 months of experience in Advanced Python and Bash scripting for building automation tools and workflow orchestration
  • 24 months of experience in Linux (including Ubuntu) and all layers of the networking stack
  • 24 months of experience in Administering and debugging production Linux systems
  • 24 months of experience in Automating infrastructure for cloud environments using AWS CloudFormation and Terraform
  • 24 months of experience in AWS services for secure and scalable infrastructure management, including EC2, IAM, CloudWatch, CloudTrail, S3, or Lambda
  • 24 months of experience in Grafana, Prometheus, filebeat and logstash
  • 24 months of experience in Designing, deploying, and managing containerized applications using Kubernetes and Docker
  • 24 months of experience in Collaborating with cross-functional teams including developers, operations, and security to deliver reliable and secure systems
  • 24 months of experience in Resolving complex systems outages, driving failure to root cause analysis, and preventing future issues.

Responsibilities

  • Design and develop systems and processes that enable highly available and scalable systems.
  • Design, build, and deliver software to dramatically improve the availability, scalability, latency, and efficiency of services.
  • Achieve breakthroughs in systems throughput by identifying and eliminating bottlenecks.
  • Champion best practices by actively collaborating with other teams in a culture that values technical design review.
  • Collaborate with other Engineers to build better software by focusing on performance, self-healing systems, configuration as code, defensive programming, and application security.
  • Participate in periodic on-call duties with a focus on resolving issues quickly once discovered, preventing recurrences, and minimizing alert fatigue.
  • Work closely with product-facing Engineers to ship impactful code.
  • Perform quantitative analysis to understand and scale systems and manage the cross-functional efforts to resolve scalability issues.
  • Produce and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies.
  • Support informed, data-driven decision-making in a fast-paced environment with competing priorities.
  • Promote Site Reliability best practices across the Engineering organization.

Benefits

  • annual cash bonus plan
  • variable compensation (OTE) for sales and customer success roles
  • equity
  • sign-on payments
  • comprehensive range of health, welfare, and wellbeing benefits
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service