Staff Software Engineer, ML Tooling and Infrastructure

Boston Dynamics•Waltham, MA

52d•$155,000 - $230,000

About The Position

As a Staff Software Engineer on the Atlas team, you will be a critical engineering pillar for a world-class group of engineers and scientists creating the next generation of humanoid robotics. Our team is pushing the boundaries of Large Behavior Models, and your role is to build the robust, scalable, and efficient software foundation that accelerates our development cycles. This is a hands-on software engineering role on a fast-paced applied AI team. Your mission is to build the tooling, pipelines, and infrastructure that bridge the gap between experimental prototypes and production-grade solutions deployed on our robots. You will have high autonomy to tackle a variety of complex engineering challenges, and your work will have a direct and immediate impact on the capabilities of the Atlas robot.

Requirements

6+ years of professional experience designing, building, and maintaining production Python applications.
Proven experience deploying and optimizing neural network models in production or real-world environments.
Deep expertise with modern software development practices: build systems (like Bazel or Pants), monorepos, Docker, and Python packaging.
Strong familiarity with the ML ecosystem, including PyTorch, ONNX, and inference servers like NVIDIA Triton.
Hands-on experience implementing distributed (multi-GPU, multi-node) training on a compute cluster.
Proficiency with production-grade database systems (e.g., PostgreSQL), ORMs, and data orchestration tools (e.g., Airflow).

Nice To Haves

Experience in robotics, behavior learning, or computer vision (VLMs).
Familiarity with modern C++.
Experience with front-end or web development for building internal tools (e.g., React, Vue).

Responsibilities

Architect and Refactor: Take ownership of our Python-based training and inference infrastructure, relentlessly improving its quality, performance, and scalability.
Build with Quality: Implement comprehensive testing, champion best practices for code quality, and build automated CI/CD pipelines to ensure reliable deployment and validation.
Own MLOps: Design, build, and operate the MLOps infrastructure for our cutting-edge behavior models, focusing on reliability, reproducibility, and speed from training to deployment.
Enable Data Insights: Develop tools and dashboards for data collection, analysis, and visualization, empowering the team to make data-driven decisions.
Manage Data Flow: Design and maintain scalable data pipelines for ingesting, processing, and versioning massive datasets from our robotics fleet.
Optimize Performance: Improve and maintain tooling for both on-robot and off-robot model inference, focusing on latency, throughput, and efficiency.
Collaborate and Scale: Partner with central infrastructure teams to optimize shared resources (e.g., compute clusters) and drive improvements that benefit the entire organization.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume