Computer Vision Engineering Intern - Fall 2026

Intuitive•Sunnyvale, CA

16h•$62 - $82•Onsite

About The Position

The candidate will join a leading R&D team to advance research and development in cutting-edge computer vision for robotic endoscopic video technologies. The focus will be on vision foundation/diffusion models, feature detection, and multimodal video analysis, contributing to next-generation AI platforms for real-world applications. We are seeking a talented individual passionate about the latest advancements in computer vision and deep learning. Expected contributions include literature research, algorithm development and implementation, and experimental evaluation on large-scale video and image datasets.

Requirements

Solid understanding and hands-on experience in computer vision, deep learning, and video analysis.
Knowledge in one or more areas: large vision-language models, generative diffusion models, feature detection, scene understanding, video classification, or multimodal learning.
Proficiency in programming with Python or C++, with experience in relevant frameworks (e.g., PyTorch, OpenCV, DINO/CLIP, HuggingFace Transformers, etc.).
Strong research and communication skills, with the ability to summarize findings and present them clearly.
Passionate about pushing the boundaries of AI technologies to solve complex, real-world problems.
Passion for developing technologies to improve the lives of patients and physicians.
Self-driven, able to work independently and deliver rapid prototyping and experimentation.
Ability to perform fast prototyping iterations; thinking outside the box to solve practical problems.
Must be currently enrolled in and returning to an accredited degree-seeking academic program in the Spring of 2027.
Must be available to work full-time (approximately 40 hours per week) during a 10-12 week period starting August or September 2026.

Responsibilities

Explore and experiment with state-of-the-art computer vision models, including foundation models and generative diffusion models, with applications to video understanding, multi-modal data, and visual feature extraction.
Prototype novel algorithms and evaluate performance using public and proprietary datasets.
Conduct literature surveys and summarize key findings in reports and presentations.