Senior Engineering Manager, ML Platform

Boston Dynamics•Waltham, MA

3d•$198,000 - $300,000

About The Position

We're looking for a Senior Engineering Manager to lead our ML Platform Team - a growing team responsible for the foundational infrastructure that powers our machine learning work. This is a player-coach role: you'll set technical direction and contribute hands-on while building out the team and establishing the processes that will scale with it. The platform is in its early stages, with some foundations in place. You'll be joining at a pivotal moment - making architectural decisions that will shape how the team and the platform grow from 4 engineers today to a team of 10–12.

Requirements

7–12 years of engineering experience, with at least 2–3 years in a formal management or tech lead capacity
Demonstrated experience building or scaling a platform, infrastructure, or ML systems team from the ground up
Technical credibility in one or more of: GPU/distributed compute infrastructure, large-scale data storage and retrieval, or data pipeline frameworks
Experience making foundational architectural decisions in an early-stage or greenfield environment
Strong cross-functional communication skills - able to translate between ML researchers, engineers, and senior leadership
Comfortable with ambiguity; able to define the roadmap rather than just execute against one
A hands-on mindset - willing and able to write code, review designs, and debug production issues alongside your team

Nice To Haves

Familiarity with compute orchestration frameworks such as Kubernetes, Slurm, or Ray
Experience with ML training workflows, dataset generation pipelines, or feature stores
Prior experience growing a team through a hiring ramp (e.g. doubling or tripling headcount)

Responsibilities

Own the strategy, roadmap, and execution for GPU compute infrastructure, ensuring it scales to meet growing model training and fine-tuning demands
Contribute directly to infrastructure design and implementation, particularly in the near term as the team grows
Drive reliability, performance, and cost efficiency across distributed training clusters.
Optimize existing and new training workloads to achieve scale.
Evaluate and adopt new hardware (GPUs, TPUs, custom accelerators) and cloud/on-prem infrastructure as the team's needs evolve
Oversee the design and operation of data storage, indexing, and retrieval systems that support large-scale dataset generation
Ensure data pipelines are performant, fault-tolerant, and meet the quality and freshness requirements of ML teams
Establish early-stage standards for data access, lineage, and governance — pragmatic and scalable, not over-engineered
Lead the development and maintenance of shared libraries and frameworks for data transformation pipelines
Partner with ML researchers and engineers to understand their workflows and translate them into reliable, reusable platform capabilities
Champion developer productivity - reduce friction for teams consuming platform services
Lay the architectural foundations of the platform, making decisions that are pragmatic today but designed to scale to a 10–12 person team and beyond
Make key architectural decisions around compute orchestration (e.g. Kubernetes, Slurm, Ray), storage systems, and pipeline frameworks
Balance short-term delivery with long-term platform health -knowing when to build, buy, or borrow
Act as a technical partner to ML research, data engineering, and product teams - translating needs into platform priorities
Communicate roadmap, incidents, and technical tradeoffs clearly to both engineers and senior leadership
Help ML teams become self-sufficient on the platform, reducing bottlenecks on the platform team itself
Actively participate in hiring to grow the team from 4 to ~10–12 engineers, including defining roles and leveling
Mentor and develop engineers, establishing a team culture early that will hold as headcount scales
Define lightweight but durable team processes - on-call rotations, incident response, and engineering standards that won't need to be rebuilt at scale
Be comfortable doing IC work yourself while simultaneously building the team's capacity to take it on