AI/ML Data Engineer

Marvell Technology•Santa Clara, CA

2d•$105,200 - $157,600

About The Position

Embedded within the AI/ML team, this role owns the data engineering layer that powers both Gen AI applications and ML model development. Responsible for building production-grade pipelines, curating AI-ready datasets for LLMs and ML models, and contributing to front-end interfaces when required — ensuring the team can deliver complete, data-driven AI products without external dependency.

Requirements

Databricks — Spark, Delta Lake, Databricks Workflows, Unity Catalog; production-grade experience required
Snowflake — advanced SQL, data modeling, performance tuning, cost management
Python — strong engineering fundamentals; PySpark, pandas, pipeline frameworks (dbt, Airflow, or equivalent)
SQL — expert level; complex transformations, query optimization, schema design
Front-End Development — React, JavaScript/TypeScript, REST API integration, and Streamlit for rapid AI/ML application prototyping and internal tooling
Solid understanding of ML lifecycle — feature stores, training pipelines, inference data patterns
Cloud-native experience on AWS, Azure, or GCP
Data quality and observability tooling

Nice To Haves

Hands-on experience with MLflow, Feast, LangChain, or LlamaIndex
Exposure to graph databases (Neo4j, Neptune, or equivalent)
Exposure to vector databases (Pinecone, Weaviate, pgvector, or equivalent)
Experience with streaming pipelines (Kafka, Kinesis, Spark Structured Streaming)
Familiarity with LLM evaluation frameworks and dataset benchmarking

Responsibilities

Architect and deliver production-grade ELT/ETL pipelines across Databricks and Snowflake for ML training, validation, and inference workflows
Build and maintain AI-ready datasets optimized for both ML model consumption and Gen AI use cases — clean, versioned, and reproducible
Curate and structure high-quality datasets for RAG pipelines and embedding generation; design document chunking strategies, metadata schemas, and grounding data layers that directly improve retrieval accuracy and Gen AI application performance
Implement data quality frameworks and data contracts at pipeline boundaries to protect model and application integrity
Build and manage vector-ready data assets, integrating with vector stores and embedding infrastructure for Gen AI applications
Establish DataOps best practices — CI/CD for pipelines, data lineage, versioning, and cost observability across platforms
Develop Streamlit applications and React-based UIs to surface model outputs, data products, and internal AI tooling
Partner with ML Engineers, Data Scientists, and AI Engineers to translate modeling and application requirements into reliable data products
Contribute to lakehouse architecture decisions, storage optimization, and compute efficiency across the AI/ML data platform