Develop and optimize vision-language-action models, including transformers, diffusion models, and multimodal encoders/decoders. Build representations for 2D/3D perception, affordances, scene understanding, and spatial reasoning. Integrate LLM-based reasoning with action planning and control policies. Design datasets for multimodal learning: video-action trajectories, instruction following, teleoperation data, and synthetic data. Interface VLAM outputs with real-time robot control stacks (navigation, manipulation, locomotion). Implement grounding layers that convert natural language instructions into symbolic, geometric, or skill-level action plans. Deploy models on on-board or edge compute platforms, optimizing for latency, safety, and reliability. Build scalable pipelines for ingesting, labeling, and generating multimodal training data. Create simulation-to-real (Sim2Real) training workflows using synthetic environments and teleoperated demonstration data. Optimize training pipelines, model parallelism, and evaluation frameworks. Work closely with robotics, hardware, controls, and safety teams to ensure model outputs are executable, safe, and predictable. Collaborate with product teams to define robot capabilities and user-facing behaviors. Participate in user and field testing to iterate on real-world performance.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level