Research Engineer, Multimodal Data

Eventual•San Francisco, CA

1d•Onsite

About The Position

Eventual is building a video-native index on top of its open-source engine, Daft, which is a distributed data engine purpose-built for multimodal AI. This technology aims to accelerate the iteration loop for training AI models, particularly in Physical AI domains like robotics and autonomous vehicles. The company has raised $30M and is backed by prominent investors. They are looking for a Research Engineer to join their Visual Understanding team, focusing on making petabytes of video data queryable by content. This role involves defining the roadmap for visual understanding capabilities, training and selecting models for large-scale annotation, and building datasets for customer models. The work is research-oriented but requires shipping to production and demonstrating impact on customer training runs.

Requirements

Strong familiarity with modern vision and multimodal models — convolution nets, VLMs, VQA, embeddings — and a sense for the SOTA that's actually deployable today vs. on a leaderboard.
Experience running these models at scale on real video and sensor data, ideally for perception tasks (detection, tracking, segmentation, retrieval, captioning).
Background from a perception team at a self-driving, robotics, or visual-data company — or equivalent depth from a research lab.
Comfortable with cloud infrastructure and large-scale data processing — you don't need to be a distributed-systems engineer, but you've shipped jobs that ran on thousands of GPU-hours of video.
Bias toward data and infrastructure: you reach for "annotate the whole corpus" before "fine-tune another model."

Nice To Haves

Experience training vision or multimodal models from scratch (not just calling APIs).
ML/AI research background — papers, citations, or a research org on your resume.
Hands-on time with big-data frameworks like Spark, Ray, or Daft.
Worked on embeddings, retrieval, or content-aware search at scale.
Experience designing labeling taxonomies or running annotation programs.

Responsibilities

Own the visual understanding roadmap end-to-end: from picking the model family for a customer's taxonomy to landing it in production inference at corpus scale.
Train, fine-tune, and evaluate VLMs, VQA models, embedding models, and convolutional perception models against customer datasets and benchmarks.
Drive down per-clip annotation cost — model selection, distillation, batching, decode pipelining — so "annotate every clip in a 10K-hour corpus" stays economical.
Build the rich, queryable datasets that customers train on: design taxonomies with researchers, instrument quality, version the outputs.
Partner with the dataloading and storage teams so visual understanding outputs flow into the index and on to the GPU without re-engineering.
Work directly with researchers at our partner labs — your shortest feedback loop is their next training iteration.