Research Engineer, Multimodal Data

Eventual•San Francisco, CA

1d•Hybrid

About The Position

Eventual is building a video-native index on top of their open-source engine, Daft, purpose-built for multimodal AI. This aims to accelerate the iteration loop for Physical AI teams by allowing them to describe the dataset they want and receive a curated table in minutes, which can then be fed to GPUs at line rate. The company has raised $30M and has a world-class team from companies like AWS, Render, Pinecone, and Tesla. They are looking for individuals passionate about powering the next generation of Physical AI.

Requirements

Strong familiarity with modern vision and multimodal models — convolution nets, VLMs, VQA, embeddings — and a sense for the SOTA that's actually deployable today vs. on a leaderboard.
Experience running these models at scale on real video and sensor data, ideally for perception tasks (detection, tracking, segmentation, retrieval, captioning).
Background from a perception team at a self-driving, robotics, or visual-data company — or equivalent depth from a research lab.
Comfortable with cloud infrastructure and large-scale data processing — you don't need to be a distributed-systems engineer, but you've shipped jobs that ran on thousands of GPU-hours of video.
Bias toward data and infrastructure: you reach for "annotate the whole corpus" before "fine-tune another model."

Nice To Haves

Experience training vision or multimodal models from scratch (not just calling APIs).
ML/AI research background — papers, citations, or a research org on your resume.
Hands-on time with big-data frameworks like Spark, Ray, or Daft.
Worked on embeddings, retrieval, or content-aware search at scale.
Experience designing labeling taxonomies or running annotation programs.

Responsibilities

Own the visual understanding roadmap end-to-end: from picking the model family for a customer's taxonomy to landing it in production inference at corpus scale.
Train, fine-tune, and evaluate VLMs, VQA models, embedding models, and convolutional perception models against customer datasets and benchmarks.
Drive down per-clip annotation cost — model selection, distillation, batching, decode pipelining — so "annotate every clip in a 10K-hour corpus" stays economical.
Build the rich, queryable datasets that customers train on: design taxonomies with researchers, instrument quality, version the outputs.
Partner with the dataloading and storage teams so visual understanding outputs flow into the index and on to the GPU without re-engineering.
Work directly with researchers at our partner labs — your shortest feedback loop is their next training iteration.