This role focuses on owning the pipelines and storage systems that feed petabyte-scale multimodal datasets into model training. The position involves building tooling and systems that are automated and efficient, enabling processing at scale and handling many small heterogeneous datasets. The goal is to implement high-performance, multimodal data pipelines capable of processing petabyte-scale datasets on 10,000s of CPUs and 100s of GPUs, while evolving data formats, storage, and processing to keep pace with cutting-edge AI advancements and scale data infrastructure for future growth. The platform must also be flexible to handle small heterogeneous datasets and ad hoc analytics queries.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed
Number of Employees
1-10 employees