Member of Technical Staff, Infrastructure / DevOps

Plato•San Francisco, CA

About The Position

Plato is an applied research lab building the foundational infrastructure to train specialized AI agents. We turn real-world data streams into high-fidelity simulated environments that generate the training signal needed to make capable models. Today, only a handful of players can train models for capable work. Compute and algorithms are rapidly commoditizing, but reinforcement learning data remains the bottleneck. Plato is changing that by automatically scaling training environments from proprietary real-world data. Our work supports frontier labs, hyperscalers, and enterprises building AI systems for complex, high-stakes work. Infrastructure is central to Plato's product and research loop. Generic cloud systems are not designed for long-running RL environments, persistent agent workspaces, replayable rollouts, storage-efficient forks, or recursive debugging loops. To train useful agents, we need infrastructure that makes environment construction, experimentation, evaluation, and iteration feel like one seamless system. As a Member of Technical Staff, Infrastructure / DevOps, you will own the systems that make Plato's research and training loops reliable at scale.

Requirements

Experience building or operating distributed systems, cloud infrastructure, orchestration platforms, or developer tooling.
Comfortable debugging across infrastructure, application, and research workflows.
Care deeply about reliability, observability, isolation, and cost efficiency.
Enjoy working with researchers and engineers to turn messy, fast-moving workflows into durable systems.
Want to build infrastructure that is part of the core product, not just internal support tooling.

Responsibilities

Build and operate purpose-built infrastructure for RL rollouts, long-running agent tasks, and environment synthesis jobs.
Scale environment VMs, snapshots, checkpointing, persistent sandboxes, and storage-efficient forks.
Design orchestration systems for fleets of agents that crawl, synthesize, evaluate, debug, and rerun experiments.
Build telemetry, logging, tracing, replay, and observability systems for thousands of concurrent agent sessions.
Improve reliability, cold starts, uptime, cost efficiency, isolation, and developer experience across the infrastructure stack.
Partner with research engineers to turn experimental workflows into repeatable, production-grade systems.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume