PhD Research Intern, Physical AI in Perception

Zoox•Foster City, CA

23h

About The Position

This internship is part of the Perception Semantics team, focused on advancing on-robot AI systems that enable machines to understand and interact with the physical AI world. Youâll work on cutting-edge problems in vision-language-action (VLA) modeling, world modeling, spatial reasoning, and mapping, contributing to both research and real-world deployment. Projects are open-ended and research-driven, giving you the opportunity to explore new ideas, develop novel approaches, and evaluate them in realistic settings. This role is ideal for Ph.D. students interested in pushing the boundaries of computer vision and embodied AI while seeing their work translate into real-world impact.

Requirements

Currently enrolled in the Ph.D program in Computer Science, Electrical/Computer Engineering, or related field, with the specialization in the CV/NLP/ML
Experience in multi-modal modeling (vision, language, or planning), with deep understanding of Vision Language Model, vision foundation model, flow-matching, temporal modeling, and reinforcement learning techniques
Strong proficiency in PyTorch and modern transformer-based model design
Currently working towards a Ph.D in a relevant engineering program
Good academic standing
Able to commit to a 12-week internship during one of the following summer 2026 cohorts: May 18th - August 7th, OR May 26th - August 14th, OR June 15th - September 4th
At least one previous industry internship, co-op, or project completed in a relevant area
Ability to relocate to the Bay Area, California (or Boston, Massachusetts) for the duration of the internship
Interns at Zoox may not use any proprietary information they are working on as part of their thesis, any published work with their university, or to be distributed to anyone outside of Zoox

Nice To Haves

Publication records in top-tier AI conferences (CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, etc)
Prior experience building foundation or end-to-end driving models for autonomous driving or robotics, or working deeply on LLM/VLM architectures (e.g., ViT, Flamingo, BEVFormer, RT-2, or GRPO-style policies)
Knowledge of RLHF/DPO/GRPO, trajectory prediction for safety