ML Infrastructure Engineer

Echo NeurotechnologiesSan Francisco, CA
22h

About The Position

We are seeking a Senior Machine Learning Infrastructure Engineer to join our team. The person who fills this role will design, build, and scale infrastructure to power massive-scale data, modeling, and analysis platforms, playing a critical role in shaping a high-performance, production-grade ML ecosystem to support rapid experimentation with diverse datasets spanning neural signals, behavior, and more. This person will have significant ownership over the ML R&D platform, working closely with domain experts to architect new cloud infrastructure, data pipelines, and modeling flows. The work will ultimately enable the development of cutting-edge models for neuroscientific discovery and neural decoding, empowering brain-computer interface technology to improve the lives of patients living with severe neurological conditions.

Requirements

  • Bachelor's degree in Computer Science, Electrical Engineering, or a related technical discipline
  • 5+ years of industry experience in software engineering, large-scale data infrastructure, or systems ML
  • Extensive proficiency in Python
  • Familiarity with PyTorch
  • Experience designing, building, and maintaining high-throughput data pipelines for large and diverse datasets
  • Experience working with distributed-training frameworks (e.g. FSDP, DeepSpeed, Megatron-LM, Ray, etc.)
  • Experience building or optimizing ML training pipelines for transformers or other large neural-network models
  • Demonstrated ability to partner closely with research and modeling teams to productionize workflows
  • Excellent communication and collaboration skills to work effectively on cross-functional and interdisciplinary teams
  • Experience having technical ownership over at least one successfully implemented collaborative project

Nice To Haves

  • Advanced degree (MS or PhD) in Computer Science, Electrical Engineering, or a related technical discipline
  • Proficiency in C++, Go, CUDA, Rust, and/or Java
  • Experience in data engineering and systems ML for time-series data
  • Deep understanding of the fundamentals of distributed systems, including scalability, fault tolerance, monitoring, observability, scheduling, performance tuning, and resource management
  • Experience with cloud-native environments and orchestration (Kubernetes, Docker, etc.)
  • Experience scaling foundation-model training infrastructure or multi-cluster computing environments

Responsibilities

  • Create flexible and performant ML infrastructure
  • Design and build systems ML cloud infrastructure to enable massive-scale modeling and analytics
  • Support diverse model exploration, hyperparameter optimization, pretraining, fine-tuning, and evaluation processes
  • Design and optimize scalable distributed training pipelines, with support for features such model sharding, cross-GPU communication, and real-time training monitoring
  • Create, operate, and maintain robust ML platforms and services across the model lifecycle
  • Make informed architecture decisions that balance performance, cost, reliability, and scalability
  • Build diverse and scalable data platforms
  • Design, build, and optimize massive-scale databases and data pipelines for scalable, flexible, and reliable data access
  • Explore research-driven, tailored data solutions using existing and simulated data, comparing performance and efficiency across solutions for typical data-access patterns
  • Create infrastructure and pipelines for ingesting internal and external datasets with varied shapes, formats, and associated metadata
  • Design and assess custom data formats for efficient storage and slicing of high-dimensional time-series data
  • Enable efficient data movement, preprocessing, and artifact management for data lineage and modeling reproducibility
  • Meet company standards for delivered solutions
  • Establish best practices for reliability, observability, reproducibility, and operational excellence across the ML ecosystem
  • Make informed and collaborative decisions with domain experts across the software & ML teams
  • Foster visibility and reproducibility within the company by maintaining extensive documentation of design decisions, evaluations of viable alternatives for selected solutions, pipeline assessments, etc.
  • Support ML R&D operations while preparing for eventual incorporation into product pipelines

Benefits

  • An opportunity to work on exciting, cutting-edge projects to transform patients’ lives in a highly collaborative work environment.
  • Competitive compensation, including stock options.
  • Comprehensive benefits package.
  • 401(k) program with matching contributions.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service