Member of Technical Staff (Data): World Models

Reka

About The Position

This role focuses on owning the pipelines and storage systems that feed petabyte-scale multimodal datasets into model training. The position involves building tooling and systems that are automated and efficient, enabling processing at scale and handling many small heterogeneous datasets. The goal is to implement high-performance, multimodal data pipelines capable of processing petabyte-scale datasets on 10,000s of CPUs and 100s of GPUs, while evolving data formats, storage, and processing to keep pace with cutting-edge AI advancements and scale data infrastructure for future growth. The platform must also be flexible to handle small heterogeneous datasets and ad hoc analytics queries.

Requirements

Knowledge of Python ETL pipelines and supporting infrastructure, data formats, and storage systems at scale.
Experience managing datasets, annotations, and data versioning for model training.
Solid grasp of ML fundamentals is essential to collaborate effectively with researchers and make sound data platform decisions.
Skilled at writing high-quality specifications for AI agents, while maintaining effective human review of AI-generated work.

Nice To Haves

High agency and ownership: proactively picks up new work according to priority, manages their own backlog, and escalates early when priorities are unclear or deadlines are at risk.
Takes responsibility for validating inputs end-to-end: spot-checks data, understands upstream preprocessing, and speaks up when something doesn't add up.
Takes responsibility for ensuring outputs are correct and handed over: actively seeks sign-off from downstream consumers, communicates caveats, and ensures relevant stakeholders are aware of changes and breaking impacts.
Cares about continuously improving pipelines, tooling, and processes so that each iteration makes the next one faster, more reliable, and easier for the team.
Comfortable with rapid, pragmatic solutions when needed, but committed to high-quality, long-term solutions.

Responsibilities

Design, automate, maintain, and optimize Python ETL pipelines (Spark/Ray) for large-scale multimodal data.
Build and maintain data cataloging, lineage, quality tooling, integrity verification, access controls, and lifecycle management systems.
Provide guidance, internal tools, and documentation to colleagues on data best practices.
Serve as a custodian of the company’s datasets, ensuring overall data health, quality, and discoverability.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume