Staff Cluster Infrastructure Engineer

Atoms•San Francisco, CA

12h•$224,000 - $284,000•Onsite

About The Position

Atoms is building the machines that power the next era of progress. Over the last decade, software has transformed the digital world. But the physical world, where food is made, minerals are mined, goods are moved, and industries are run, remains far less intelligent, far less efficient, and far more constrained. We’re changing that. Atoms builds Physical AI— real-world robots for the industries that move civilization forward, starting with food, mining, and transport. Our systems are designed to understand, predict, and control the real world with precision, turning complex physical operations into something more reliable, more scalable, and more productive. This work requires more than robotics. It requires deep integration across hardware, software, AI, operations, manufacturing, and real estate. We don’t just build machines in a lab. We deploy them into real environments, operate them, learn from them, and improve them until they work at scale. We are roboticists, engineers, operators, and builders. We believe the next great technology companies will not only transform information, but the physical systems that shape everyday life. If you want to work on hard problems with real-world impact, join us.

Requirements

6+ years experience operating GPU compute on Kubernetes (or similar orchestration), with the judgment to scale it as demand grows.
Strong programming and scripting skills in Python, Go, or similar.
Familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation.
Comfort with bare-metal Linux environments, GPU hardware, and networking.
A bias toward automation, reliability, and operating critical systems well.

Responsibilities

Manage and automate our GPU training clusters, including provisioning, bootstrapping, and lifecycle management.
Automate bare-metal bring-up so new machines come online quickly and reliably as we add capacity.
Build software abstractions that present a clean, unified interface to our training and simulation workloads.
Work at the hardware/software boundary, where speed and reliability are critical, continuously raising the bar for automation and uptime.
Run day-to-day operations: diagnose and resolve issues quickly when systems are under pressure.
Design our infrastructure to scale smoothly as we grow from a smaller cluster of machines toward a larger fleet.