Principal Software Engineer, Enterprise Scalability

Klaviyo•Boston, MA

7d•Remote

About The Position

At Klaviyo, we value the unique backgrounds, experiences and perspectives each Klaviyo (we call ourselves Klaviyos) brings to our workplace each and every day. We believe everyone deserves a fair shot at success and appreciate the experiences each person brings beyond the traditional job requirements. If you’re a close but not exact match with the description, we hope you’ll still consider applying. Want to learn more about life at Klaviyo? Visit klaviyo.com/careers to see how we empower creators to own their own destiny. Be Klaviyo’s senior IC for scale, you will report into a VP of Engineering and lead performance, reliability, multi‑region, and large‑tenant readiness. You’ll drive platform-wide architectural change, hunt bottlenecks and optimize systems, and partner across teams to productionize improvements. Given that this is an IC role with no direct reports; you will lead via technical depth, hands‑on impact, and crisp cross‑org alignment.

Requirements

12+ years scaling multi‑tenant SaaS with a reputation for removing major bottlenecks and proving impact with data.
Performance engineering, capacity planning, sharding/partitioning, caching/back‑pressure, multi‑region readiness, and high‑volume migrations; you turn hotspots into robust patterns.
You apply AI to scale work—profiling assistance, workload modeling, synthetic traffic generation, anomaly detection, and runbook copilots—always with explicit guardrails and observability.
You align teams through fitness functions, scorecards, and readiness gates that accelerate—not block—delivery; you communicate tradeoffs crisply to execs and engineers.
Curious, adaptable, and proactive in exploring AI that responsibly improves scale outcomes.

Nice To Haves

Company-wide fitness functions (latency/throughput/error rates) are adopted and reviewed regularly.
2–3 bottlenecks removed with documented, reproducible testbeds; pXX latencies and error rates improve on top enterprise workloads; repeat P0s trend down.
AI-driven anomaly detection reduces alert noise while improving signal; generative load testing and copilot runbooks are used in release/readiness checks for the top critical services; time‑to‑isolate regressions drops 20–30%.

Responsibilities

Define enterprise scalability fitness functions (latency/throughput/error rates) and a scorecard; align teams to SLOs and budgets.
Design/implement sharding and partitioning strategies, caching/back‑pressure, multi‑region readiness, and high‑volume migration paths.
Build lightweight enablement: benchmarks, profiling harnesses, reproducible testbeds; pair with teams to land fixes.
Lead scalability reviews and readiness gates that accelerate—not block—delivery; drive incident deep dives tied to systemic fixes.
Communicate clearly to execs and engineers, tying technical work to business impact and customer outcomes.
Integrate AI into scale and resiliency work—from proactive anomaly detection to synthetic load and guided runbooks—so performance improvements stick and incidents don’t repeat.