Technical Lead Manager, ML Platform Infrastructure

Nuro•Mountain View, CA

46d•$235,030 - $352,290

About The Position

Nuro is seeking an experienced Technical Lead Manager with deep expertise in large-scale infrastructure, workload orchestration, as well as batch and streaming data processing systems to join our ML Infrastructure team. In this role, you will lead the evolution of our core platform, ensuring our researchers and engineers have seamless access to the compute and data resources required to build the future of autonomous driving. You will drive the strategy for automated resource provisioning, high-performance workload scheduling, and efficient feature management. As a TLM, you will balance technical hands-on leadership with people management, mentoring a high-performing team while partnering closely with ML Research and Autonomy teams to eliminate infrastructure bottlenecks and accelerate the Nuro Driver™ development lifecycle.

Requirements

Experience: 6+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems with 3+ years of people/team management experience.
Resource Provisioning: Deep familiarity with modern Infrastructure-as-Code and provisioning tools (e.g., Terraform, Pulumi, or Crossplane).
Workload Scheduling: Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes/KubeRay, Ray, Slurm, or Volcano).
Data Dumping (ETL): Proven expertise in large-scale data extraction and transformation. You must be proficient in at least one distributed processing framework, such as Apache Spark or Apache Beam.
Feature Management: Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching).

Nice To Haves

Advanced degree (Ph.D. or M.Sc.) in Computer Science, Systems Engineering, or a related technical field.
Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities).
Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading.
Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS/GCP/Azure).

Responsibilities

Setting Technical Strategy: Defining the roadmap for a unified ML platform that abstracts complex cloud infrastructure.
Resource Provisioning & IaC: Scaling our automated infrastructure-as-code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments.
Intelligent Scheduling: Designing and optimizing workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training.
Data Dumping & ETL: Designing robust pipelines for the extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats.
Feature Caching & Feature Stores: Implementing robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features.
Team Leadership: Mentoring and growing a team of software and systems engineers, fostering a culture of operational excellence and technical innovation.