Principal Engineer, Core Infrastructure

Klaviyo•Boston, MA

About The Position

As a hands‑on principal for compute, networking, storage, runtimes (e.g., Kubernetes), CI/CD, and observability, you’ll architect the service platform that lets teams ship fast and safely. IC role—no direct reports—you lead via design, code, and incident excellence, setting technical standards and SLOs for platform services. What You’ll Do Architect and evolve the Kubernetes platform, service mesh, networking, storage, and CI/CD pipelines; ship golden paths and IaC modules. Define platform SLOs; use error budgets to guide reliability vs. velocity trade‑offs; drive incident learning and readiness reviews. Improve developer velocity (build/deploy times, flaky tests, local dev ergonomics) with measurable results. Lead capacity planning and commitments; build guardrails for cost, security, and compliance with Security/FinOps partners. Write high‑impact code, automation, and tooling; mentor across teams and raise the bar on operational excellence Embed AI in the developer experience—from code generation to observability and incident response—so teams ship faster and safer by default. Who You Are Experience: 10+ years building and operating cloud platforms (compute, networking, storage, runtimes like Kubernetes), with a track record of multi‑region HA and SLO rigor. Technical expertise: Deep in Kubernetes, service mesh, Terraform/IaC, CI/CD, and production observability; you ship golden paths and guardrails that lift the whole org. Experience with databases and storage systems, including SQL and NoSQL databases, and object, block, or file storage platforms. AI tools & automation: You’ve brought AI into platform engineering—from copilot‑assisted workflows and intelligent test generation to AIOps for incident triage, anomaly detection, and runbook automation—with clear security and cost boundaries. Ops leadership: You lead via design reviews, incident excellence, and SLO/error‑budget tradeoffs communicated in business terms. AI fluency: You’re hands‑on with AI tools and help teams adopt them responsibly. Nice to Haves Core SLOs & velocity: ≥99.95% SLOs for core services; 25–50% faster build/deploy times; developer‑reported friction trending down. AI‑enabled platform: Approved AI tooling is integrated into IDE/CI/CD with repo policies and auditability; ≥70% MAU among eligible engineers; MTTR down 20–30% via AI‑assisted triage; flaky‑test rate decreases through targeted, AI‑suggested fixes. Guardrails in place: Cost, security, and compliance controls are codified as IaC modules and enforced in paved roads. Experience with enterprise governance, including compliance and audit requirements. Familiarity with GDPR and data privacy considerations in large-scale, production environments. Success in 6–12 Months ≥99.95% SLOs for core services; 25–50% faster build/deploy times; reduced developer‑reported friction; incident recurrence trending down. We use Covey as part of our hiring and / or promotional process. For jobs or candidates in NYC, certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on April 3, 2025. Please see the independent bias audit report covering our use of Covey here

Requirements

10+ years building and operating cloud platforms (compute, networking, storage, runtimes like Kubernetes), with a track record of multi‑region HA and SLO rigor
Deep in Kubernetes, service mesh, Terraform/IaC, CI/CD, and production observability; you ship golden paths and guardrails that lift the whole org
Experience with databases and storage systems, including SQL and NoSQL databases, and object, block, or file storage platforms
You’ve brought AI into platform engineering—from copilot‑assisted workflows and intelligent test generation to AIOps for incident triage, anomaly detection, and runbook automation—with clear security and cost boundaries
You lead via design reviews, incident excellence, and SLO/error‑budget tradeoffs communicated in business terms
You’re hands‑on with AI tools and help teams adopt them responsibly

Nice To Haves

≥99.95% SLOs for core services
25–50% faster build/deploy times
developer‑reported friction trending down
Approved AI tooling is integrated into IDE/CI/CD with repo policies and auditability
≥70% MAU among eligible engineers
MTTR down 20–30% via AI‑assisted triage
flaky‑test rate decreases through targeted, AI‑suggested fixes
Cost, security, and compliance controls are codified as IaC modules and enforced in paved roads
Experience with enterprise governance, including compliance and audit requirements
Familiarity with GDPR and data privacy considerations in large-scale, production environments

Responsibilities

Architect and evolve the Kubernetes platform, service mesh, networking, storage, and CI/CD pipelines
Ship golden paths and IaC modules
Define platform SLOs
Use error budgets to guide reliability vs. velocity trade‑offs
Drive incident learning and readiness reviews
Improve developer velocity (build/deploy times, flaky tests, local dev ergonomics) with measurable results
Lead capacity planning and commitments
Build guardrails for cost, security, and compliance with Security/FinOps partners
Write high‑impact code, automation, and tooling
Mentor across teams and raise the bar on operational excellence
Embed AI in the developer experience—from code generation to observability and incident response—so teams ship faster and safer by default