About The Position

At Klaviyo, we value the unique backgrounds, experiences and perspectives each Klaviyo (we call ourselves Klaviyos) brings to our workplace each and every day. We believe everyone deserves a fair shot at success and appreciate the experiences each person brings beyond the traditional job requirements. If you’re a close but not exact match with the description, we hope you’ll still consider applying. Want to learn more about life at Klaviyo? Visit klaviyo.com/careers to see how we empower creators to own their own destiny. At Klaviyo, Platform Engineering is what you get when you treat operating complex systems as a software engineering problem. Our Observability Platform group applies that philosophy to how we collect, store, and surface signals about the health of our products and infrastructure. We build and run the shared observability stack—metrics, logs, traces, alerting, and developer-facing tooling—that enables every product and platform team at Klaviyo to understand how their systems behave in production and to ship changes with confidence. As a Senior Observability Platform Engineer, you will design, build, and operate the core observability services that power Klaviyo’s monitoring and incident response. You’ll partner closely with product engineering, other platform teams, and security to define how we instrument services, standardize telemetry, and make it easy for engineers to debug issues in a fast-growing, distributed environment.

Requirements

  • Strong software engineering experience in at least one modern language (e.g., Go, Python, Java) and comfort working in Linux-based production environments.
  • Hands-on experience designing and operating observability systems at scale (for example: Prometheus / Cortex / Thanos / Mimir, OpenTelemetry, Grafana, alerting pipelines, log aggregation systems, or distributed tracing backends).
  • A track record of improving reliability and performance of complex, distributed applications using telemetry and data-driven insights.
  • Experience with infrastructure-as-code and modern cloud-native tooling (e.g., Terraform, Kubernetes, service meshes, CI/CD systems).
  • Strong technical communication and collaboration skills—you’re comfortable partnering with many teams, writing clear documentation, and leading technical discussions.
  • A mindset that values simple, well‑understood systems, iterative improvement, and a bias toward empowering other engineers rather than being on the critical path for every change.

Responsibilities

  • Own observability platforms end-to-end – Design, implement, and operate scalable, highly available systems for metrics, logging, tracing, and alerting (e.g., Prometheus-compatible metrics, time‑series storage, log pipelines, distributed tracing backends).
  • Build opinionated developer experiences – Create libraries, dashboards, runbooks, and self-service tooling that make “doing the right thing” for observability the easiest path for Klaviyo engineers.
  • Set standards for telemetry – Define and evangelize best practices for instrumentation, SLOs, alerting, and incident readiness across services and teams.
  • Drive reliability through data – Use observability data to identify performance bottlenecks, reliability risks, and architectural improvements, and collaborate with teams to address them.
  • Automate everything – Treat infrastructure as code; build automation for provisioning, configuration, scaling, and upgrades of observability components.
  • Mentor and multiply – Partner with engineers across Klaviyo to level up skills in debugging distributed systems, designing effective alerts, and using observability tools to make better product and reliability decisions.
  • Utilize AI – You’ve already experimented with AI in work or personal projects, and you’re excited to dive in and learn fast. You’re hungry to responsibly explore new AI tools and workflows, finding ways to make your work smarter and more efficient.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service