Member of Technical Staff - ML Infrastructure Engineer

Black Forest LabsSan Francisco, CA
Hybrid

About The Position

About Black Forest Labs We’re the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We’re creating the generative models that power how people make images and video—tools used by millions of creators, developers, and businesses worldwide. Our FLUX models are among the most advanced in the world, and we’re just getting started. Headquartered in Freiburg, Germany with a growing presence in San Francisco, we’re scaling fast while staying true to what makes us different: research excellence, open science, and building technology that expands human creativity. Why This Role You'll design, deploy, and maintain the ML infrastructure backbone that makes frontier AI research possible. This isn't abstract systems work—every decision you make directly impacts whether a multi-week training run succeeds, whether inference stays fast enough for production, whether researchers can iterate quickly or wait hours for resources. What You’ll Work On You'll be the person who: Designs, deploys, and maintains cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes) that researchers and products depend on Implements and manages network-based cloud file systems and blob/S3 storage solutions optimized for ML workloads at scale Develops and maintains Infrastructure as Code (IaC) for resource provisioning—because manual configuration doesn't scale and configuration drift breaks things Implements and optimizes CI/CD pipelines for ML workflows, making it easy for researchers to go from experiment to production Designs and implements custom autoscaling solutions for ML workloads where standard approaches fall short Ensures security best practices across the ML infrastructure stack without creating friction that slows down research Provides developer-friendly tools and practices that make ML operations efficient—because infrastructure that's hard to use doesn't get used

Requirements

  • You've built and managed ML infrastructure at scale and understand that supporting AI research is fundamentally different from traditional cloud infrastructure.
  • You've been paged because a training run failed.
  • You've debugged why storage became the bottleneck.
  • You know the difference between infrastructure that works in demos and infrastructure that works when researchers depend on it for months-long experiments.
  • Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services—you know which services matter and which are marketing
  • Extensive experience with Kubernetes and Slurm cluster management in production environments
  • Expertise in Infrastructure as Code tools (Terraform, Ansible, etc.) and the discipline to actually use them
  • Proven track record managing and optimizing network-based cloud file systems and object storage for ML workloads
  • Experience with CI/CD tools and practices (CircleCI, GitHub Actions, ArgoCD, etc.) in ML contexts
  • Strong understanding of security principles and best practices in cloud environments—without making security the enemy of velocity
  • Experience with monitoring and observability tools (Prometheus, Grafana, Loki, etc.) that help you understand what's actually happening
  • Familiarity with ML workflows and GPU infrastructure management—you understand what researchers need
  • Demonstrated ability to handle complex migrations and breaking changes in production environments without losing data or breaking experiments

Nice To Haves

  • Have experience building custom autoscaling solutions for ML workloads that standard tools can't handle
  • Bring knowledge of cost optimization strategies for cloud-based ML infrastructure (because GPU hours add up)
  • Are familiar with MLOps practices and tools
  • Have experience with high-performance computing (HPC) environments
  • Understand data versioning and experiment tracking for ML
  • Know network optimization techniques for distributed ML training
  • Have worked with multi-cloud or hybrid cloud architectures
  • Are familiar with container security and vulnerability scanning tools

Responsibilities

  • Designs, deploys, and maintains cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes) that researchers and products depend on
  • Implements and manages network-based cloud file systems and blob/S3 storage solutions optimized for ML workloads at scale
  • Develops and maintains Infrastructure as Code (IaC) for resource provisioning—because manual configuration doesn't scale and configuration drift breaks things
  • Implements and optimizes CI/CD pipelines for ML workflows, making it easy for researchers to go from experiment to production
  • Designs and implements custom autoscaling solutions for ML workloads where standard approaches fall short
  • Ensures security best practices across the ML infrastructure stack without creating friction that slows down research
  • Provides developer-friendly tools and practices that make ML operations efficient—because infrastructure that's hard to use doesn't get used

Benefits

  • We’ll cover reasonable travel costs to make this possible.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service