Senior Software Engineer, AI Training & Infrastructure

Skild AISan Mateo, CA
$200,000 - $300,000

About The Position

Skild AI, Inc. seeks a Senior Software Engineer, AI Training & Infrastructure in San Mateo, CA. You will be responsible for building and scaling training infrastructure and tools that support the full ML lifecycle—data preparation, training orchestration, evaluation, and deployment—for real-world robotics applications. This includes performance, reliability, observability, and developer productivity across distributed training systems, as well as data processing for multimodal datasets, performance tuning of training jobs, and media processing/compression.

Requirements

  • Must have a master’s degree (or foreign equivalent) in Computer Science, Robotics, Engineering, or a related field and two (2) years of experience in machine learning infrastructure.
  • Must also have two (2) years of experience designing and operating distributed training pipelines at scale, including data preprocessing, orchestration, and evaluation.
  • Must have any experience with each of the following: Python or C++ and at least one deep learning library (e.g., PyTorch, TensorFlow, or JAX); and CI/CD and automated testing for ML/infra services.
  • Must have knowledge of: Optimizing data loading and I/O for deep learning workloads (e.g., PyTorch DataLoader, sharding, prefetching, or caching); processing multimodal datasets and formats (e.g., HDF5, TFRecord, Parquet, or equivalent) and image processing/compression (e.g., OpenCV or ffmpeg); cloud-based training in AWS, Google Cloud, or Azure; Implementing monitoring, logging, and alerting for training systems; Linux OS fundamentals and operation at large scale; distributed systems and ML training techniques/models; and core software engineering principles, including algorithms, data structures, and system design.

Responsibilities

  • Architecting, building, and maintaining distributed training pipelines and frameworks spanning data ingest/preprocessing, large-scale training, and evaluation.
  • Optimizing training performance and resource utilization by identifying bottlenecks and implementing improvements in data loading, I/O, caching, sharding, and prefetching.
  • Integrating state-of-the-art ML techniques into production training systems in collaboration with research/ML teams.
  • Implementing monitoring, logging, alerting, automated testing, and CI/CD for reliable training operations.
  • Developing developer tooling and documentation, including dashboards and utilities, to streamline experimentation at scale and improve engineer productivity.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service