AI Agent Data Pipeline Intern

XPENG•Santa Clara, CA

About The Position

XPENG is a leading smart technology company integrating advanced AI and autonomous driving technologies into its vehicles, including electric vehicles (EVs), electric vertical take-off and landing (eVTOL) aircraft, and robotics. The team builds platform capabilities that support the development and deployment of Autonomous Driving AI models, working closely with Machine Learning Engineers to improve the efficiency, quality, and reliability of the experiment lifecycle. They are building an LLM-powered agent to assist MLEs in collecting experiment context, analyzing progress and results, and surfacing useful insights. This internship focuses on building the data foundation for this agent, specifically cleaning, organizing, and connecting various data sources, especially noisy chat and meeting data. The intern will develop data pipelines and LLM-assisted data cleaning workflows to enable the agent to correctly retrieve, interpret, and reason over experiment-related information. Depending on progress and interest, there may be opportunities to fine-tune LLM-based models using curated experiment data to enhance agent performance on domain-specific tasks.

Requirements

Strong skills in Python, SQL, and data processing.
Experience working with structured and unstructured data, including text-heavy sources such as documents, notes, messages, or logs.
Familiarity with data pipelines, ETL workflows, or large-scale data processing.
Interest in LLM development, LLM evaluation, agentic AI systems, RAG pipelines, semantic retrieval, prompt engineering, or LLM-assisted data processing.
Familiarity with machine learning workflows, model training, evaluation metrics, or MLOps concepts.
Strong analytical thinking and attention to data quality, consistency, and reliability.
Comfort working with ambiguous data sources and collaborating with ML and platform engineers to clarify requirements.
Previous experience building internal tools, automation scripts, or data quality checks.

Responsibilities

Build pipelines to ingest and organize experiment-related data from team communications, meeting notes, experiment plans, analysis documents, metrics, and evaluation results.
Use LLM-based methods to clean noisy unstructured data, extract experiment-relevant information, and convert fragmented discussions into structured records.
Design data schemas, metadata, and quality checks that make experiment context easier to search, trace, and use in downstream agent workflows.
Support retrieval and indexing workflows, including semantic search or RAG-style pipelines, so the agent can access relevant experiment context.
Prepare curated datasets for agent evaluation and, where applicable, LLM fine-tuning or instruction-tuning.
Work with MLEs and platform engineers to understand experiment workflows, data gaps, and the types of insights most useful for planning and analysis.
Evaluate whether the agent uses curated experiment data correctly to generate summaries, comparisons, recommendations, and analysis insights.
Contribute to internal tools, dashboards, or reports that help teams monitor experiment status, outcomes, and trends.

Benefits

A fun, supportive and engaging environment.
Infrastructures and computational resources to support your work.
Opportunity to work on cutting edge technologies with the top talents in the field.
Opportunity to make significant impact on the transportation revolution by the means of advancing autonomous driving.
Competitive compensation package.
Snacks, lunches, dinners, and fun activities.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume