Member of Technical Staff - ML Infrastructure Engineer

Black Forest Labs•San Francisco, CA

1d•Hybrid

About The Position

About Black Forest Labs We’re the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We’re creating the generative models that power how people make images and video—tools used by millions of creators, developers, and businesses worldwide. Our FLUX models are among the most advanced in the world, and we’re just getting started. Headquartered in Freiburg, Germany with a growing presence in San Francisco, we’re scaling fast while staying true to what makes us different: research excellence, open science, and building technology that expands human creativity. Why This Role You'll design, deploy, and maintain the ML infrastructure backbone that makes frontier AI research possible. This isn't abstract systems work—every decision you make directly impacts whether a multi-week training run succeeds, whether inference stays fast enough for production, whether researchers can iterate quickly or wait hours for resources. What You’ll Work On You'll be the person who: Designs, deploys, and maintains cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes) that researchers and products depend on Implements and manages network-based cloud file systems and blob/S3 storage solutions optimized for ML workloads at scale Develops and maintains Infrastructure as Code (IaC) for resource provisioning—because manual configuration doesn't scale and configuration drift breaks things Implements and optimizes CI/CD pipelines for ML workflows, making it easy for researchers to go from experiment to production Designs and implements custom autoscaling solutions for ML workloads where standard approaches fall short Ensures security best practices across the ML infrastructure stack without creating friction that slows down research Provides developer-friendly tools and practices that make ML operations efficient—because infrastructure that's hard to use doesn't get used

Requirements

You've built and managed ML infrastructure at scale and understand that supporting AI research is fundamentally different from traditional cloud infrastructure.
You've been paged because a training run failed.
You've debugged why storage became the bottleneck.
You know the difference between infrastructure that works in demos and infrastructure that works when researchers depend on it for months-long experiments.
Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services—you know which services matter and which are marketing
Extensive experience with Kubernetes and Slurm cluster management in production environments
Expertise in Infrastructure as Code tools (Terraform, Ansible, etc.) and the discipline to actually use them
Proven track record managing and optimizing network-based cloud file systems and object storage for ML workloads
Experience with CI/CD tools and practices (CircleCI, GitHub Actions, ArgoCD, etc.) in ML contexts
Strong understanding of security principles and best practices in cloud environments—without making security the enemy of velocity
Experience with monitoring and observability tools (Prometheus, Grafana, Loki, etc.) that help you understand what's actually happening
Familiarity with ML workflows and GPU infrastructure management—you understand what researchers need
Demonstrated ability to handle complex migrations and breaking changes in production environments without losing data or breaking experiments

Nice To Haves

Have experience building custom autoscaling solutions for ML workloads that standard tools can't handle
Bring knowledge of cost optimization strategies for cloud-based ML infrastructure (because GPU hours add up)
Are familiar with MLOps practices and tools
Have experience with high-performance computing (HPC) environments
Understand data versioning and experiment tracking for ML
Know network optimization techniques for distributed ML training
Have worked with multi-cloud or hybrid cloud architectures
Are familiar with container security and vulnerability scanning tools

Responsibilities

Designs, deploys, and maintains cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes) that researchers and products depend on
Implements and manages network-based cloud file systems and blob/S3 storage solutions optimized for ML workloads at scale
Develops and maintains Infrastructure as Code (IaC) for resource provisioning—because manual configuration doesn't scale and configuration drift breaks things
Implements and optimizes CI/CD pipelines for ML workflows, making it easy for researchers to go from experiment to production
Designs and implements custom autoscaling solutions for ML workloads where standard approaches fall short
Ensures security best practices across the ML infrastructure stack without creating friction that slows down research
Provides developer-friendly tools and practices that make ML operations efficient—because infrastructure that's hard to use doesn't get used