Principal Engineer, Enterprise Scalability

Klaviyo•Boston, MA

About The Position

Be Klaviyo’s senior IC for scale, you will report into a VP of Engineering and lead performance, reliability, multi‑region, and large‑tenant readiness. You’ll drive platform-wide architectural change, hunt bottlenecks and optimize systems, and partner across teams to productionize improvements. Given that this is an IC role with no direct reports; you will lead via technical depth, hands‑on impact, and crisp cross‑org alignment.

Requirements

Experience: 12+ years scaling multi‑tenant SaaS with a reputation for removing major bottlenecks and proving impact with data.
Technical expertise: Performance engineering, capacity planning, sharding/partitioning, caching/back‑pressure, multi‑region readiness, and high‑volume migrations; you turn hotspots into robust patterns.
AI tools & automation: You apply AI to scale work—profiling assistance, workload modeling, synthetic traffic generation, anomaly detection, and runbook copilots—always with explicit guardrails and observability.
Cross‑org influence: You align teams through fitness functions, scorecards, and readiness gates that accelerate—not block—delivery; you communicate tradeoffs crisply to execs and engineers.
AI fluency: Curious, adaptable, and proactive in exploring AI that responsibly improves scale outcomes.

Nice To Haves

Scale scorecard: Company‑wide fitness functions (latency/throughput/error rates) are adopted and reviewed regularly.
High‑impact wins: 2–3 bottlenecks removed with documented, reproducible testbeds; pXX latencies and error rates improve on top enterprise workloads; repeat P0s trend down.
AI‑assisted scale engineering: AI‑driven anomaly detection reduces alert noise while improving signal; generative load testing and copilot runbooks are used in release/readiness checks for the top critical services; time‑to‑isolate regressions drops 20–30%.
Success in 6–12 Months Company‑wide scale scorecard in place; 2–3 high‑impact bottlenecks removed; top enterprise workloads show improved pXX latencies and error rates; fewer repeat P0s.

Responsibilities

Define enterprise scalability fitness functions (latency/throughput/error rates) and a scorecard; align teams to SLOs and budgets.
Design/implement sharding and partitioning strategies, caching/back‑pressure, multi‑region readiness, and high‑volume migration paths.
Build lightweight enablement: benchmarks, profiling harnesses, reproducible testbeds; pair with teams to land fixes.
Lead scalability reviews and readiness gates that accelerate—not block—delivery; drive incident deep dives tied to systemic fixes.
Communicate clearly to execs and engineers, tying technical work to business impact and customer outcomes.
Integrate AI into scale and resiliency work—from proactive anomaly detection to synthetic load and guided runbooks—so performance improvements stick and incidents don’t repeat.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume