AI Data Engineer

Apple•Cupertino, CA

About The Position

The Applied Data Science team within Legal Operations is building production-grade AI for a global legal organization. The AI Data Engineer owns the pipelines, data feeds, and integration infrastructure that ensure AI applications have the right data, in the right form, at the right time. This role is embedded within the AI team and works in close partnership with AI and data colleagues to ensure AI systems have reliable, high-quality data at every stage.

Requirements

Bachelor's degree in Computer Science, Data Science, Information Systems, or related field (or equivalent experience)
4+ years of experience in data engineering related to AI application
Strong proficiency in SQL and Python for data engineering and transformation
Experience with cloud data platforms (Snowflake, Databricks, BigQuery, or similar)
Experience with ETL/ELT tools (dbt, Fivetran, Airflow, or similar)
Experience building and maintaining REST APIs
Understanding of data modeling and data transformation best practices
Experience with version control (Git) and CI/CD practices
Ability to work closely with AI/ML teams and understand their data requirements

Nice To Haves

Master's degree preferred
Experience with vector databases (Pinecone, Weaviate, Chroma), embedding generation pipelines, document stores (MongoDB or similar) and their integration patterns
Understanding of RAG, MCP architectures, context engineering principles, and how data quality affects retrieval performance
Experience with semantic layer technologies (dbt Semantic Layer, Cube, AtScale), knowledge graphs (Neo4j), or ontology design
Experience with streaming or event-driven data architectures (Kafka or similar)
Familiarity with legal operations data (matter management, eBilling, CLM, document management)

Responsibilities

Design and implement data pipelines that ingest, transform, and deliver data from legal systems (matter management, eBilling, CLM, document management) to AI applications
Build and maintain pipelines that load and refresh vector databases, document stores, and graph databases used by AI retrieval systems
Engineer data transformations that prepare legal data for AI consumption — chunking, embedding generation, metadata enrichment, and schema normalization
Build upstream and downstream integrations with MCP (Model Context Protocol), vector databases, and knowledge graphs to support context engineering and AI retrieval systems
Develop and maintain APIs that expose structured and unstructured data to AI applications and analytics tools
Implement data quality checks and validation at pipeline ingestion points to ensure AI systems receive reliable, complete data
Build monitoring and alerting for pipeline health, data freshness, and load failures
Understand AI data access patterns and optimize data delivery for AI performance
Integrate with the semantic layer — consuming entity resolution outputs, taxonomy mappings, and enriched datasets to ground AI applications
Implement ETL/ELT processes using dbt, Fivetran, or similar tools with a focus on reliability and maintainability
Document pipeline designs, data contracts, and operational runbooks