We are looking for a graduate student interested in advancing the field of multimodal scene understanding, focusing on scene understanding using natural language for robot dialog and/or indoor monitoring with a large language model. The intern will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. Internships regularly lead to one or more publications in top-tier venues, which can later become part of the intern's doctoral work. The ideal candidates are senior Ph.D. students with experience in deep learning for audio-visual, signal, and natural language processing. Good programming skills in Python and knowledge of deep learning frameworks such as PyTorch are essential. Multiple positions are available with a flexible start date (not just Spring/Summer but throughout 2026) and duration (typically 3-6 months).
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Career Level
Intern
Education Level
Ph.D. or professional degree
Number of Employees
5,001-10,000 employees