Principal Engineer, Data Infrastructure

Klaviyo•Boston, MA

2h•$244,000 - $366,000•Hybrid

About The Position

Own the technical direction and delivery for Klaviyo's data platform - streaming, batch compute, storage/lakehouse, and governance - now the foundation for autonomous, agent-driven experiences. Klaviyo's data platform processes billions of events daily across billions of consumer profiles for hundreds of thousands of brands, and increasingly that data is consumed not just by people but by a growing number of AI agents acting on customers' behalf. The footprint grows on two axes at once, ever‑larger data volumes and a steady stream of new data scenarios (new sources, domains, and consumption patterns) and your systems have to scale on both. You'll design and ship systems that are fast, reliable, cost‑efficient, and safe for autonomous access, creating paved roads for data producers and consumers, human and agent alike. This is an individual‑contributor role (no direct reports); you lead through architecture, code, and influence. Expectations align to Lead/Principal IC behaviors: establishing SLOs, driving technical evolution, and acting as the interface across teams.

Requirements

10+ years building and operating distributed data systems (e.g., Kafka/PubSub, Flink/Spark/Beam, Airflow/Dagster, Iceberg/Delta/Hudi; Snowflake/BigQuery; object storage) with multi‑tenant reliability.
Technical expertise: Data ingestion/CDC, stream processing, batch orchestration, lakehouse patterns, catalog/lineage, governance, and access controls, measured by freshness, availability, and correctness SLOs.
AI tools & automation: You apply ML/GenAI to data platforms - semantic catalog search, auto‑docs, data‑quality anomaly detection, SQL/pipeline generation - with human‑in‑the‑loop review and privacy controls.
Influence & enablement: You land data contracts, connectors, and templates that speed delivery for producers and consumers; you mentor via design docs and pairing.
AI fluency (Klaviyo default): You experiment, learn fast, and share AI wins responsibly.

Nice To Haves

Regional isolation/replication strategies, privacy‑by‑design, and data governance in regulated contexts.
Adopted paved roads: Producers/consumers are on standard ingestion, processing, and storage paths; schema governance and contracts reduce breakage.
SLOs & efficiency: ≥99.9% freshness for key domains; measurable cost/TB reductions; production debugging is faster with defined readiness reviews and incident learning.
AI‑augmented data operations: Semantic discovery and auto‑documentation cover the majority of high‑value datasets; AI‑assisted DQ monitors reduce data incidents on top pipelines by 25–40%; pipeline authoring and review times drop 15–25% with AI in the loop.

Responsibilities

Design and implement core data platform capabilities (e.g., event ingestion/CDC, stream processing, batch orchestration, data lake/warehouse patterns, catalog/lineage, governance, access, and compliance).
Define and uphold SLOs for data freshness, availability, and correctness; author/run readiness reviews, incident response, and post‑incident learning for your domain.
Author ADRs/RFCs, land data contracts and schema governance, and standardize connectors and templates that accelerate developer velocity.
Profile, tune, and right‑size systems for performance and cost; partner with FinOps on unit‑economics guardrails.
Pair with product teams and analytics/ML to expose the right abstractions and unblock customer value quickly.
Contribute high‑quality code and reviews; mentor Staff/ Sr. engineers across pillars through example and enablement (not line management).
Use AI to streamline data workflows, from authoring and testing pipelines to catalog/search and DQ, so analysts, ML, and product teams move faster with confidence.