ML Infrastructure Engineer

Echo Neurotechnologies•San Francisco, CA

About The Position

We are seeking a Senior Machine Learning Infrastructure Engineer to join our team. The person who fills this role will design, build, and scale infrastructure to power massive-scale data, modeling, and analysis platforms, playing a critical role in shaping a high-performance, production-grade ML ecosystem to support rapid experimentation with diverse datasets spanning neural signals, behavior, and more. This person will have significant ownership over the ML R&D platform, working closely with domain experts to architect new cloud infrastructure, data pipelines, and modeling flows. The work will ultimately enable the development of cutting-edge models for neuroscientific discovery and neural decoding, empowering brain-computer interface technology to improve the lives of patients living with severe neurological conditions.

Requirements

Bachelor's degree in Computer Science, Electrical Engineering, or a related technical discipline
5+ years of industry experience in software engineering, large-scale data infrastructure, or systems ML
Extensive proficiency in Python
Familiarity with PyTorch
Experience designing, building, and maintaining high-throughput data pipelines for large and diverse datasets
Experience working with distributed-training frameworks (e.g. FSDP, DeepSpeed, Megatron-LM, Ray, etc.)
Experience building or optimizing ML training pipelines for transformers or other large neural-network models
Demonstrated ability to partner closely with research and modeling teams to productionize workflows
Excellent communication and collaboration skills to work effectively on cross-functional and interdisciplinary teams
Experience having technical ownership over at least one successfully implemented collaborative project

Nice To Haves

Advanced degree (MS or PhD) in Computer Science, Electrical Engineering, or a related technical discipline
Proficiency in C++, Go, CUDA, Rust, and/or Java
Experience in data engineering and systems ML for time-series data
Deep understanding of the fundamentals of distributed systems, including scalability, fault tolerance, monitoring, observability, scheduling, performance tuning, and resource management
Experience with cloud-native environments and orchestration (Kubernetes, Docker, etc.)
Experience scaling foundation-model training infrastructure or multi-cluster computing environments

Responsibilities

Create flexible and performant ML infrastructure
Design and build systems ML cloud infrastructure to enable massive-scale modeling and analytics
Support diverse model exploration, hyperparameter optimization, pretraining, fine-tuning, and evaluation processes
Design and optimize scalable distributed training pipelines, with support for features such model sharding, cross-GPU communication, and real-time training monitoring
Create, operate, and maintain robust ML platforms and services across the model lifecycle
Make informed architecture decisions that balance performance, cost, reliability, and scalability
Build diverse and scalable data platforms
Design, build, and optimize massive-scale databases and data pipelines for scalable, flexible, and reliable data access
Explore research-driven, tailored data solutions using existing and simulated data, comparing performance and efficiency across solutions for typical data-access patterns
Create infrastructure and pipelines for ingesting internal and external datasets with varied shapes, formats, and associated metadata
Design and assess custom data formats for efficient storage and slicing of high-dimensional time-series data
Enable efficient data movement, preprocessing, and artifact management for data lineage and modeling reproducibility
Meet company standards for delivered solutions
Establish best practices for reliability, observability, reproducibility, and operational excellence across the ML ecosystem
Make informed and collaborative decisions with domain experts across the software & ML teams
Foster visibility and reproducibility within the company by maintaining extensive documentation of design decisions, evaluations of viable alternatives for selected solutions, pipeline assessments, etc.
Support ML R&D operations while preparing for eventual incorporation into product pipelines

Benefits

An opportunity to work on exciting, cutting-edge projects to transform patients’ lives in a highly collaborative work environment.
Competitive compensation, including stock options.
Comprehensive benefits package.
401(k) program with matching contributions.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume