About The Position

As a Member of Technical Staff at Fluidstack, you will design, develop, and maintain software solutions that power our AI infrastructure and enable our customers to run complex ML workloads efficiently at scale. Your responsibilities are aligned with the success of our customers and your teammates, and you'll work side-by-side with them to push forward the state of the art in AI/ML. A day's work may include: Developing and optimizing job scheduling systems to maximize GPU utilization and throughput for ML workloads, Building and improving software interfaces for cluster management that support PyTorch, JAX, and other ML frameworks, Creating monitoring and observability tools for tracking training progress, resource usage, and system performance, Implementing data pipeline optimizations to accelerate training and inference workflows, Designing and developing APIs and services to integrate with MLflow, Kubeflow, Weights & Biases, and other ML tooling, Writing libraries and utilities to simplify the deployment and management of distributed training jobs.

Requirements

  • Developed software for training or serving large-scale ML models (1000+ GPU scale)
  • Optimized distributed training performance across multiple nodes and accelerators
  • Implemented APIs and interfaces for ML platforms that prioritize developer experience
  • Experience with orchestration systems like Kubernetes or SLURM in the context of large scale ML workloads
  • Built or contributed to ML infrastructure tools (e.g., Ray, Horovod, DeepSpeed), and have experience with ML experiment tracking and workflow systems (MLflow, Kubeflow, W&B)

Responsibilities

  • Developing and optimizing job scheduling systems to maximize GPU utilization and throughput for ML workloads
  • Building and improving software interfaces for cluster management that support PyTorch, JAX, and other ML frameworks
  • Creating monitoring and observability tools for tracking training progress, resource usage, and system performance
  • Implementing data pipeline optimizations to accelerate training and inference workflows
  • Designing and developing APIs and services to integrate with MLflow, Kubeflow, Weights & Biases, and other ML tooling
  • Writing libraries and utilities to simplify the deployment and management of distributed training jobs

Benefits

  • Competitive total compensation package (cash + equity)
  • Retirement or pension plan, in line with local norms
  • Health, dental, and vision insurance
  • Generous PTO policy, in line with local norms
  • Remote first work environment with access to WeWork for remote locations
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service