ML Infra Engineer - Platform

Physical IntelligenceSan Francisco, CA
1d

About The Position

Physical Intelligence is bringing general-purpose AI into the physical world. We are a team of engineers, scientists, roboticists, and company builders developing foundation models and learning algorithms to power the robots of today and the physically-actuated devices of the future. The Team The Infrastructure team builds and operates the backbone of everything PI does: from training state-of-the-art VLA models, to orchestrating large-scale simulation, to reliably deploying intelligence across fleets of physical robots. The team works closely with researchers, robotics runtime, product, and platform engineers to ensure infrastructure scales from prototype to production-grade deployments. In This Role You Will Own core cloud platform infrastructure: You will design, build, and operate CPU compute platforms (such as Anyscale and similar systems), including cluster lifecycle management, capacity planning, quotas, and scheduling primitives. A key goal is making it straightforward and predictable to bring up new clusters and environments as needs evolve. Build and scale platform systems: You will operate and evolve Kubernetes clusters and service deployment patterns, and help build a scalable microservice platform for internal systems such as evaluation services, operational tooling, and internal APIs. This includes supporting safe rollouts, upgrades, and rollback strategies. Own workflow orchestration infrastructure: You will take platform-level ownership of async and multi-stage workflows, ensuring they are reliable, observable, and easy to extend. These workflows power large-scale evaluation, data processing, and long-running infrastructure tasks. Drive observability and cost-aware infrastructure: You will treat logging, metrics, tracing, and alerting as first-class platform primitives, and build systems that surface reliability and performance issues early. You will also help improve cost visibility and enable cost-aware decision-making at the infrastructure level. Harden cloud foundations: You will own cloud-first infrastructure with multi-cloud considerations, designing networking, DNS, quotas, and cloud primitives that behave predictably. A major part of this work is reducing infra churn by standardizing patterns, abstractions, and interfaces. Improve developer experience: You will build clear, documented interfaces for using platform infrastructure, reducing the gap between “I need infra” and “I can run my workload.” This includes supporting consistent local vs. remote development workflows and improving self-serve infrastructure usage. Collaborate and lead through ownership: You will work closely with researchers and infra peers to understand requirements and constraints, translate fast-moving needs into reusable infrastructure, and own systems end-to-end, from design through operation.

Requirements

  • Deep experience with cloud platforms (GCP, AWS) and distributed systems: compute orchestration, networking, autoscaling, service meshes, load balancing.
  • Ability to reason about system bottlenecks, performance tuning, and cost optimizations across compute, networking, and storage.
  • Comfort with Kubernetes, cluster-level reliability, and service-oriented architectures.
  • Solid intuition around scalability, performance, and failure modes.
  • Experience with infrastructure-as-code (e.g. Terraform), containerization, and modern platform engineering practices.
  • Familiarity with logging, metrics, tracing, incident response, SLOs, and debugging complex distributed systems.
  • Ability to take full ownership of systems and operate them in production.
  • Strong cross-functional communication and ownership mindset.
  • Experience (2-5 years) working in fast-moving or early-stage environments where ambiguity is normal with demonstrated growth trajectory.

Nice To Haves

  • Experience with large-scale ML training, evaluation, or simulation infrastructure.
  • Familiarity with workflow orchestration systems (e.g. Temporal).
  • Experience with secrets management systems (e.g., Doppler).
  • Experience designing shared compute platforms or quota systems.
  • Background in observability, cost optimization, or internal platform tooling.
  • Exposure to robotics, simulation, or real-time systems.

Responsibilities

  • Own core cloud platform infrastructure: You will design, build, and operate CPU compute platforms (such as Anyscale and similar systems), including cluster lifecycle management, capacity planning, quotas, and scheduling primitives.
  • Build and scale platform systems: You will operate and evolve Kubernetes clusters and service deployment patterns, and help build a scalable microservice platform for internal systems such as evaluation services, operational tooling, and internal APIs.
  • Own workflow orchestration infrastructure: You will take platform-level ownership of async and multi-stage workflows, ensuring they are reliable, observable, and easy to extend.
  • Drive observability and cost-aware infrastructure: You will treat logging, metrics, tracing, and alerting as first-class platform primitives, and build systems that surface reliability and performance issues early.
  • Harden cloud foundations: You will own cloud-first infrastructure with multi-cloud considerations, designing networking, DNS, quotas, and cloud primitives that behave predictably.
  • Improve developer experience: You will build clear, documented interfaces for using platform infrastructure, reducing the gap between “I need infra” and “I can run my workload.”
  • Collaborate and lead through ownership: You will work closely with researchers and infra peers to understand requirements and constraints, translate fast-moving needs into reusable infrastructure, and own systems end-to-end, from design through operation.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service