Member of Technical Staff - Research Engineer

Black Forest Labs•San Francisco, CA

2d•Hybrid

About The Position

Black Forest Labs is the team behind Latent Diffusion, Stable Diffusion, and FLUX, foundational technologies that have transformed image and video creation. They develop generative models used by millions globally. Their FLUX models are state-of-the-art, and the company is rapidly expanding. With headquarters in Freiburg, Germany, and a presence in San Francisco, Black Forest Labs emphasizes research excellence, open science, and fostering human creativity. This role is crucial for translating research ideas into reality through large-scale training, addressing complex systems and performance challenges in GPU clusters. The position involves working closely with researchers, producing code, measurements, kernels, debugging tools, and training system changes to enable advanced research. The company is open to various seniority levels, seeking individuals with deep technical ownership who can navigate ambiguous problems, verify results, and take responsibility for outcomes.

Requirements

Experience working deeply on large-scale training systems, ideally as part of a training group working closely with researchers
Strong PyTorch fluency, including comfort reading and modifying low-level training code rather than only using high-level APIs
Experience with distributed training concepts such as FSDP, tensor/model/context/sequence parallelism, activation checkpointing, NCCL, and overlapping compute and communication
Hands-on experience improving training throughput, memory footprint, or stability in real training runs
Experience profiling GPU workloads with tools like Nsight Systems, Nsight Compute, torch profiler, trace viewers, or custom telemetry
Practical GPU performance judgment: you may use modern coding agents and tools as much as you want, but you need the understanding to verify correctness, numerical behavior, and performance, and to own the result
Understanding of low-precision training and quantization tradeoffs: FP8, MXFP8, FP4/NVFP4-style formats, scaling, accumulation, numerical validation, and convergence risk
Good research judgment: you can partner with researchers on ablations, understand what the measurements do and do not prove, and keep optimization work tied to model-quality outcomes
Comfortable operating in ambiguity: sometimes the task is a clean implementation, sometimes it is a production fire, and sometimes it is figuring out which of three plausible explanations is actually true

Nice To Haves

Supported or co-owned training for a frontier foundation model that shipped or reached a major release
Written or substantially improved forward/backward GPU kernels, or have shown you can make progress on kernel-level work with strong measurement and validation discipline
Worked on attention performance, variable sequence length training, non-standard attention patterns
Experience on Hopper or Blackwell-class GPUs
Worked on low-precision training
Experience with diffusion, flow matching, DiT, and multimodal generative model training; if your deepest background is autoregressive or LLM training systems, you are excited to learn the diffusion and multimodal modeling stack quickly
Can move naturally between profiler traces, kernel code, distributed systems failures, and research discussions

Responsibilities

Improve the performance, reliability, and numerical stability of production training runs for large multimodal generative models
Profile full training steps across model code, attention, kernels, data loading, encoders, communication, optimizer steps, checkpointing, and memory pressure
Implement and validate GPU-level optimizations: fused kernels, attention paths, low-precision matmuls, quantization kernels, CUDA/Triton/CuTe/CUTLASS experiments, and no-compile alternatives where they make sense
Push lower-precision training forward, including FP8 / MXFP8 / FP4-style paths, weight and activation quantization, accumulation choices, convergence risk, and quality tradeoffs against baseline training runs
Work with researchers to translate architecture changes into efficient training implementations, and help distinguish real model-quality progress from changes that only look good in a microbenchmark
Debug distributed training failures: NaNs, loss spikes, silent numerical drift, memory leaks, stragglers, bad nodes, NCCL issues, and throughput cliffs
Build benchmarking and profiling harnesses that make performance claims trustworthy across hardware, shapes, sequence lengths, and training configurations
Help the training team move quickly when an urgent bottleneck appears, while turning repeated failures into better abstractions and tools