ML Infra Engineer

Humble Robotics•San Francisco, CA

About The Position

We're looking for an ML infrastructure engineer to help design, build, and scale the foundational systems we need to realize our ambitious vision. You'll work on tooling and infrastructure that supports every stage of the ML training flywheel and be an important voice in the technical and organizational decisions that shape our work. From areas spanning vehicle compute to data collection to dataset curation to large-scale model training and deployment, help us build reliable, performant, and secure infrastructure that every team at Humble Robotics can rely on. It's fun here. We are doing cool stuff. The ideal candidate is a first-principles thinker who is comfortable being a broad generalist. Work on every layer of the stack to help make the software iteration loop as fast and efficient as possible. We're a small team, and your input, experience, and knowledge will play a critical role in shaping every system we build, operate, and depend on to achieve our mission.

Requirements

Experience building and operating high-availability web services on cloud infrastructure
Experience with infrastructure-as-code and configuration management tools (we use Terraform and Ansible)
Experience building and maintaining CI/CD pipelines and managing deployments
Fluent in security fundamentals including Linux hardening, network security, and cryptographic principles
Hands-on experience with cluster scheduling systems for running large-scale batch computation
Comfortable reading, writing, and extending non-trivial code (not just scripting)
Eligible to work in the United States

Nice To Haves

Hands-on experience managing large, high-performance ML training clusters
Working knowledge of distributed training frameworks and high-performance networking for ML workloads
Prior infrastructure experience at an early-stage autonomous vehicle or robotics company
Comfort operating as an early team member—high ownership, low ego, fast iteration

Responsibilities

Work on data collection infrastructure that moves sensor data reliably and efficiently from our vehicles into our ML platform
Develop batch compute pipelines for cataloging, exploring, and curating raw data into high-quality training sets
Design and scale distributed ML training on our GPU clusters
Take ownership of performance, observability, efficiency, and security across the full pipeline
Partner with the ML team to understand their workflows and translate them into reliable infrastructure that accelerates their work