Senior AI Data Platform Engineer

Adobe•San Jose, CA

14h•Remote

About The Position

The Adobe Express Data Platform is a critical system for millions of creators, processing billions of events daily. It supports streaming, feature serving, agent data APIs, and a lakehouse powering personalization, experiments, and AI workflows. The team is focused on evolving this into a streaming-first, self-healing, agent-ready Lakehouse. This role is systems-first, focusing on building foundational infrastructure for AI, analytics, and autonomous agents at scale, rather than building ML models. The engineer will automate manual, repetitive, or slow platform workflows using an agentic-first approach. Key challenges include reducing pipeline latency to real-time, building MCP-compatible agent data APIs for AI systems, enhancing the ML Attribute Store with low-latency online serving, and pioneering AI-powered data governance for self-healing pipelines. The team's motto is to make the platform simpler, faster, and more reliable, emphasizing a disciplined approach to shipping quickly.

Requirements

6+ years of experience in data platform engineering, distributed systems, or backend infrastructure at scale.
Deep hands-on experience with Apache Spark, Databricks, Delta Lake, or equivalent lakehouse technologies (Iceberg, Hudi).
Proven track record building and operating large-scale pipelines processing billions of events daily with sub-hour latency SLAs.
Strong experience with streaming systems: Kafka, Kinesis, Flink, Spark Structured Streaming, or Delta Live Tables.
Proficiency in Python and/or Scala; SQL fluency required. Java or Go is a plus.
Experience with cloud platforms (AWS or Azure), containerization (Docker, Kubernetes), and CI/CD for data pipelines.
Production experience integrating LLMs into engineering workflows — not prototypes, but systems running against real data with real users. Includes prompt engineering, tool-use/function-calling, structured output parsing, and context window management.
Hands-on experience with agentic AI frameworks and multi-agent orchestration (LangChain, LangGraph, CrewAI, AutoGen, or custom agent loops with memory, planning, and tool routing).
Understanding of MCP (Model Context Protocol) and/or A2A protocols for exposing platform capabilities as agent-consumable tool servers — or demonstrable ability to build equivalent agent-tool integration surfaces.
Experience building or operating ML Feature Stores (online and/or offline), including training-serving skew mitigation, feature freshness trade-offs, and real-time feature computation.
Familiarity with RAG architectures: embedding generation, vector databases (FAISS, Pinecone, Weaviate, Databricks Vector Search), document chunking strategies, and retrieval evaluation.
Exposure to semantic layers, knowledge graphs, or metadata-driven data discovery systems (Unity Catalog, DataHub, OpenMetadata) that enable agents to autonomously navigate enterprise data catalogs.
Ability to build evaluation and feedback pipelines for AI systems — measuring agent accuracy, latency, cost attribution per workflow, and reliability at scale.
Demonstrated use of AI-powered developer tools (Claude Code, Cursor, GitHub Copilot, or similar) to accelerate engineering velocity.
Agentic-first instinct: you default to “can an agent do this?” before reaching for manual solutions, scripts, or traditional automation. You see every repetitive workflow as a target for autonomous replacement.
Challenger mentality: you question inherited architecture, push back on “we’ve always done it this way,” and drive fast improvement through first-principles thinking. You treat the status quo as technical debt.
Extreme bias for action and time-to-market: you ship iteratively, prefer “good enough now” over “perfect later,” and unblock yourself. You measure success in production impact, not design docs.
Systems thinker who traces dependencies, considers second-order effects, and asks “why did this break?” not just “how do I fix it?”
End-to-end ownership from design through production through 2 AM incident response. Platform reliability is personal.

Nice To Haves

Experience building AI-powered developer tools, self-serve data platforms, or code generation agents that reduce engineering toil.
Experience migrating batch-first data architectures to streaming-first without disrupting downstream consumers — including dual-write patterns, shadow pipelines, and incremental cutover strategies
Experience building autonomous monitoring systems that detect, diagnose, and remediate pipeline failures without human intervention — circuit breakers, auto-rollback, and intelligent retry logic
Familiarity with Adobe-native data and analytics solutions (CJA, AEP, Adobe Analytics) and data governance automation including FinOps practices, cost attribution, and compliance frameworks.
Contributions to open-source data or AI infrastructure projects, published engineering blog posts, or conference talks.
BS/MS in Computer Science, Engineering, or equivalent practical experience.

Responsibilities

Design and build streaming-first data pipelines that collapse end-to-end latency from hours to minutes, through event-driven architectures.
Own and extend the ML Attribute Store — building low-latency online serving capabilities alongside batch feature computation with unified batch/streaming aggregation to prevent training-serving skew.
Build MCP-compatible Agent Data APIs and tool servers that make the lakehouse discoverable and queryable by autonomous AI agents through standardized protocols, semantic layers, and catalog-driven data discovery.
Develop agentic framework — automated anomaly detection, duplicate event cleanup, transient event lifecycle management with audit trails, pipeline self-healing, and root cause analysis automation.
Drive operational excellence: observability, incident detection and response automation, performance tuning, cost optimization, and on-call ownership for mission-critical platform services.
Collaborate across Data Science, Personalization, Engineering Operations, Product, and Experimentation teams to translate platform capabilities into self-serve infrastructure that reduces engineering toil for non-platform teams.
Use and champion AI-powered developer tools (Claude Code, Cursor, GitHub Copilot, or similar) to accelerate personal and team engineering velocity.