AI/ML Data Engineer

Marvell Technology•Santa Clara, CA

About The Position

About Marvell Marvell’s semiconductor solutions are the essential building blocks of the data infrastructure that connects our world. Across enterprise, cloud and AI, and carrier architectures, our innovative technology is enabling new possibilities. At Marvell, you can affect the arc of individual lives, lift the trajectory of entire industries, and fuel the transformative potential of tomorrow. For those looking to make their mark on purposeful and enduring innovation, above and beyond fleeting trends, Marvell is a place to thrive, learn, and lead. Your Team, Your Impact Embedded within the AI/ML team, this role owns the data engineering layer that powers both Gen AI applications and ML model development. Responsible for building production-grade pipelines, curating AI-ready datasets for LLMs and ML models, and contributing to front-end interfaces when required — ensuring the team can deliver complete, data-driven AI products without external dependency. What You Can Expect Key Responsibilities Architect and deliver production-grade ELT/ETL pipelines across Databricks and Snowflake for ML training, validation, and inference workflows Build and maintain AI-ready datasets optimized for both ML model consumption and Gen AI use cases — clean, versioned, and reproducible Curate and structure high-quality datasets for RAG pipelines and embedding generation; design document chunking strategies, metadata schemas, and grounding data layers that directly improve retrieval accuracy and Gen AI application performance Implement data quality frameworks and data contracts at pipeline boundaries to protect model and application integrity Build and manage vector-ready data assets, integrating with vector stores and embedding infrastructure for Gen AI applications Establish DataOps best practices — CI/CD for pipelines, data lineage, versioning, and cost observability across platforms Develop Streamlit applications and React-based UIs to surface model outputs, data products, and internal AI tooling Partner with ML Engineers, Data Scientists, and AI Engineers to translate modeling and application requirements into reliable data products Contribute to lakehouse architecture decisions, storage optimization, and compute efficiency across the AI/ML data platform What We're Looking For Required Skills Databricks — Spark, Delta Lake, Databricks Workflows, Unity Catalog; production-grade experience required Snowflake — advanced SQL, data modeling, performance tuning, cost management Python — strong engineering fundamentals; PySpark, pandas, pipeline frameworks (dbt, Airflow, or equivalent) SQL — expert level; complex transformations, query optimization, schema design Front-End Development — React, JavaScript/TypeScript, REST API integration, and Streamlit for rapid AI/ML application prototyping and internal tooling Solid understanding of ML lifecycle — feature stores, training pipelines, inference data patterns Cloud-native experience on AWS, Azure, or GCP Data quality and observability tooling Nice to Have Hands-on experience with MLflow, Feast, LangChain, or LlamaIndex Exposure to graph databases (Neo4j, Neptune, or equivalent) Exposure to vector databases (Pinecone, Weaviate, pgvector, or equivalent) Experience with streaming pipelines (Kafka, Kinesis, Spark Structured Streaming) Familiarity with LLM evaluation frameworks and dataset benchmarking Expected Base Pay Range (USD) 142,400 - 213,300, $ per annum The successful candidate’s starting base pay will be determined based on job-related skills, experience, qualifications, work location and market conditions. The expected base pay range for this role may be modified based on market conditions. Additional Compensation and Benefit Elements Marvell is committed to providing exceptional, comprehensive benefits that support our employees at every stage - from internship to retirement and through life’s most important moments. Our offerings are built around four key pillars: financial well-being, family support, mental and physical health, and recognition. Highlights include an employee stock purchase plan with a 2-year look back, family support programs to help balance work and home life, robust mental health resources to prioritize emotional well-being, and a recognition and service awards to celebrate contributions and milestones. We look forward to sharing more with you during the interview process. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability or protected veteran status. Any applicant who requires a reasonable accommodation during the selection process should contact Marvell HR Helpdesk at [email protected]. Interview Integrity To support fair and authentic hiring practices, candidates are not permitted to use AI tools (such as transcription apps, real-time answer generators like ChatGPT or Copilot, or automated note-taking bots) during interviews. These tools must not be used to record, assist with, or enhance responses in any way. Our interviews are designed to evaluate your individual experience, thought process, and communication skills in real time. Use of AI tools without prior instruction from the interviewer will result in disqualification from the hiring process. This position may require access to technology and/or software subject to U.S. export control laws and regulations, including the Export Administration Regulations (EAR). As such, applicants must be eligible to access export-controlled information as defined under applicable law. Marvell may be required to obtain export licensing approval from the U.S. Department of Commerce and/or the U.S. Department of State. Except for U.S. citizens, lawful permanent residents, or protected individuals as defined by 8 U.S.C. 1324b(a)(3), all applicants may be subject to an export license review process prior to employment. #LI-TT1 Join our talent community to hear about company news, job openings and events. Join our Talent Community! Marvell’s semiconductor solutions are the essential building blocks of the data infrastructure that connects our world. Across enterprise, cloud and AI, automotive, and carrier architectures, our innovative technology is enabling new possibilities. At Marvell, you can affect the arc of individual lives, lift the trajectory of entire industries, and fuel the transformative potential of tomorrow. For those looking to make their mark on purposeful and enduring innovation, above and beyond fleeting trends, Marvell is a place to thrive, learn, and lead. Recruitment fraud is a well-known way that third parties try to get personal information or to steal money from you. Please review Marvell’s guidance here to learn more on how you can protect yourself.

Requirements

Databricks — Spark, Delta Lake, Databricks Workflows, Unity Catalog; production-grade experience required
Snowflake — advanced SQL, data modeling, performance tuning, cost management
Python — strong engineering fundamentals; PySpark, pandas, pipeline frameworks (dbt, Airflow, or equivalent)
SQL — expert level; complex transformations, query optimization, schema design
Front-End Development — React, JavaScript/TypeScript, REST API integration, and Streamlit for rapid AI/ML application prototyping and internal tooling
Solid understanding of ML lifecycle — feature stores, training pipelines, inference data patterns
Cloud-native experience on AWS, Azure, or GCP
Data quality and observability tooling

Nice To Haves

Hands-on experience with MLflow, Feast, LangChain, or LlamaIndex
Exposure to graph databases (Neo4j, Neptune, or equivalent)
Exposure to vector databases (Pinecone, Weaviate, pgvector, or equivalent)
Experience with streaming pipelines (Kafka, Kinesis, Spark Structured Streaming)
Familiarity with LLM evaluation frameworks and dataset benchmarking

Responsibilities

Architect and deliver production-grade ELT/ETL pipelines across Databricks and Snowflake for ML training, validation, and inference workflows
Build and maintain AI-ready datasets optimized for both ML model consumption and Gen AI use cases — clean, versioned, and reproducible
Curate and structure high-quality datasets for RAG pipelines and embedding generation; design document chunking strategies, metadata schemas, and grounding data layers that directly improve retrieval accuracy and Gen AI application performance
Implement data quality frameworks and data contracts at pipeline boundaries to protect model and application integrity
Build and manage vector-ready data assets, integrating with vector stores and embedding infrastructure for Gen AI applications
Establish DataOps best practices — CI/CD for pipelines, data lineage, versioning, and cost observability across platforms
Develop Streamlit applications and React-based UIs to surface model outputs, data products, and internal AI tooling
Partner with ML Engineers, Data Scientists, and AI Engineers to translate modeling and application requirements into reliable data products
Contribute to lakehouse architecture decisions, storage optimization, and compute efficiency across the AI/ML data platform