AI Engineer (VLM & VLA)

Foundation•San Francisco, CA

31d•$150,000 - $300,000

About The Position

Develop and optimize vision-language-action models, including transformers, diffusion models, and multimodal encoders/decoders. Build representations for 2D/3D perception, affordances, scene understanding, and spatial reasoning. Integrate LLM-based reasoning with action planning and control policies. Design datasets for multimodal learning: video-action trajectories, instruction following, teleoperation data, and synthetic data. Interface VLAM outputs with real-time robot control stacks (navigation, manipulation, locomotion). Implement grounding layers that convert natural language instructions into symbolic, geometric, or skill-level action plans. Deploy models on on-board or edge compute platforms, optimizing for latency, safety, and reliability. Build scalable pipelines for ingesting, labeling, and generating multimodal training data. Create simulation-to-real (Sim2Real) training workflows using synthetic environments and teleoperated demonstration data. Optimize training pipelines, model parallelism, and evaluation frameworks. Work closely with robotics, hardware, controls, and safety teams to ensure model outputs are executable, safe, and predictable. Collaborate with product teams to define robot capabilities and user-facing behaviors. Participate in user and field testing to iterate on real-world performance.

Requirements

Strong experience with training multimodal models, including VLAs, VLMs, vision transformers, LLMs.
Ability to build and iterate on large-scale training pipelines.
Deep proficiency in PyTorch or JAX, distributed training, and GPU acceleration.
Strong software engineering skills in Python and modern ML tooling.
Experience with (synthetic) dataset creation and curation.
Understanding of real-time deployment constraints on embedded hardware.
MSc or PhD in Computer Science, Robotics, Machine Learning, or related field—or equivalent industry experience.

Nice To Haves

Familiarity with robotics simulation environments (Isaac Lab, Mujoco, or similar).
Hands-on experience with robotics, embodied AI, or reinforcement/imitation learning.

Responsibilities

Develop and optimize vision-language-action models
Build representations for 2D/3D perception, affordances, scene understanding, and spatial reasoning
Integrate LLM-based reasoning with action planning and control policies
Design datasets for multimodal learning
Interface VLAM outputs with real-time robot control stacks
Implement grounding layers that convert natural language instructions into symbolic, geometric, or skill-level action plans
Deploy models on on-board or edge compute platforms
Build scalable pipelines for ingesting, labeling, and generating multimodal training data
Create simulation-to-real (Sim2Real) training workflows
Optimize training pipelines, model parallelism, and evaluation frameworks
Work closely with robotics, hardware, controls, and safety teams
Collaborate with product teams to define robot capabilities and user-facing behaviors
Participate in user and field testing to iterate on real-world performance