ML Infra Engineer (Data Systems)

Physical IntelligenceSan Francisco, CA
6d

About The Position

As an ML Infra Engineer (Data Systems), you’ll build and operate the data infrastructure that powers large-scale robot learning. Your systems will sit directly between raw data sources and training/evaluation, enabling us to move faster while maintaining performance, correctness, and reliability at scale. This is a systems role at the intersection of distributed systems, storage, and machine learning infrastructure. The Team The Infrastructure organization builds the foundations that make large-scale learning possible at PI. This includes training systems, data platforms, evaluation pipelines, and the tooling that allows researchers and roboticists to work with massive datasets safely and efficiently.2

Requirements

  • Strong software engineering fundamentals.
  • Experience building distributed systems or large-scale data pipelines.
  • Comfort reasoning about performance, memory, I/O, and storage efficiency.
  • Familiarity with batch and/or streaming processing systems.
  • Experience with object storage systems and data format tradeoffs.
  • Ownership mindset: design, build, operate, and iterate on systems end-to-end.
  • Enjoy working closely with researchers and unblocking fast-moving projects.

Nice To Haves

  • Experience with large ML training pipelines or dataloading systems.
  • Knowledge of columnar or custom data formats.
  • Experience with systems like ClickHouse, Ray, Flink, Spark, or similar.
  • Hands-on experience operating petabyte-scale datasets.
  • Debugging and fixing performance bottlenecks in data-heavy systems.

Responsibilities

  • Data Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw multimodal data.
  • Batch & Streaming Systems: Operate large-scale batch and streaming workflows over massive datasets.
  • Storage Systems: Design object storage layouts, metadata systems, and efficient access patterns; choose file formats with performance and scalability in mind.
  • Data Lifecycle Management: Build systems for backfills, dataset rebuilds, garbage collection, and large-scale transformations.
  • Training-Time Performance: Optimize dataloaders, sharding, prefetching, caching, and throughput to reduce time from data arrival → model training.
  • Metadata & Indexing: Build scalable metadata stores for datasets, annotations, and training artifacts.
  • Data Movement: Move hundreds of terabytes to petabytes efficiently across clusters and environments.
  • Operational Correctness: Implement observability, validation, and guardrails to prevent silent data regressions.
  • Cross-Functional Collaboration: Work closely with cross-functional teams of researchers, engineers and roboticists to translate evolving data needs into robust systems.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service