ML Infra Engineer (Data Systems)

Physical Intelligence•San Francisco, CA

18d

About The Position

As an ML Infra Engineer (Data Systems), you’ll build and operate the data infrastructure that powers large-scale robot learning. Your systems will sit directly between raw data sources and training/evaluation, enabling us to move faster while maintaining performance, correctness, and reliability at scale. This is a systems role at the intersection of distributed systems, storage, and machine learning infrastructure. The Team The Infrastructure organization builds the foundations that make large-scale learning possible at PI. This includes training systems, data platforms, evaluation pipelines, and the tooling that allows researchers and roboticists to work with massive datasets safely and efficiently.2

Requirements

Strong software engineering fundamentals.
Experience building distributed systems or large-scale data pipelines.
Comfort reasoning about performance, memory, I/O, and storage efficiency.
Familiarity with batch and/or streaming processing systems.
Experience with object storage systems and data format tradeoffs.
Ownership mindset: design, build, operate, and iterate on systems end-to-end.
Enjoy working closely with researchers and unblocking fast-moving projects.

Nice To Haves

Experience with large ML training pipelines or dataloading systems.
Knowledge of columnar or custom data formats.
Experience with systems like ClickHouse, Ray, Flink, Spark, or similar.
Hands-on experience operating petabyte-scale datasets.
Debugging and fixing performance bottlenecks in data-heavy systems.

Responsibilities

Data Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw multimodal data.
Batch & Streaming Systems: Operate large-scale batch and streaming workflows over massive datasets.
Storage Systems: Design object storage layouts, metadata systems, and efficient access patterns; choose file formats with performance and scalability in mind.
Data Lifecycle Management: Build systems for backfills, dataset rebuilds, garbage collection, and large-scale transformations.
Training-Time Performance: Optimize dataloaders, sharding, prefetching, caching, and throughput to reduce time from data arrival → model training.
Metadata & Indexing: Build scalable metadata stores for datasets, annotations, and training artifacts.
Data Movement: Move hundreds of terabytes to petabytes efficiently across clusters and environments.
Operational Correctness: Implement observability, validation, and guardrails to prevent silent data regressions.
Cross-Functional Collaboration: Work closely with cross-functional teams of researchers, engineers and roboticists to translate evolving data needs into robust systems.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume