Member of Technical Staff

FluidStack

309d

About The Position

As a Member of Technical Staff at Fluidstack, you will design, develop, and maintain software solutions that power our AI infrastructure and enable our customers to run complex ML workloads efficiently at scale. Your responsibilities are aligned with the success of our customers and your teammates, and you'll work side-by-side with them to push forward the state of the art in AI/ML. A day's work may include: Developing and optimizing job scheduling systems to maximize GPU utilization and throughput for ML workloads, Building and improving software interfaces for cluster management that support PyTorch, JAX, and other ML frameworks, Creating monitoring and observability tools for tracking training progress, resource usage, and system performance, Implementing data pipeline optimizations to accelerate training and inference workflows, Designing and developing APIs and services to integrate with MLflow, Kubeflow, Weights & Biases, and other ML tooling, Writing libraries and utilities to simplify the deployment and management of distributed training jobs.

Requirements

Developed software for training or serving large-scale ML models (1000+ GPU scale)
Optimized distributed training performance across multiple nodes and accelerators
Implemented APIs and interfaces for ML platforms that prioritize developer experience
Experience with orchestration systems like Kubernetes or SLURM in the context of large scale ML workloads
Built or contributed to ML infrastructure tools (e.g., Ray, Horovod, DeepSpeed), and have experience with ML experiment tracking and workflow systems (MLflow, Kubeflow, W&B)

Responsibilities

Developing and optimizing job scheduling systems to maximize GPU utilization and throughput for ML workloads
Building and improving software interfaces for cluster management that support PyTorch, JAX, and other ML frameworks
Creating monitoring and observability tools for tracking training progress, resource usage, and system performance
Implementing data pipeline optimizations to accelerate training and inference workflows
Designing and developing APIs and services to integrate with MLflow, Kubeflow, Weights & Biases, and other ML tooling
Writing libraries and utilities to simplify the deployment and management of distributed training jobs

Benefits

Competitive total compensation package (cash + equity)
Retirement or pension plan, in line with local norms
Health, dental, and vision insurance
Generous PTO policy, in line with local norms
Remote first work environment with access to WeWork for remote locations

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Member of Technical Staff

About The Position

Requirements

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company