Member of Technical Staff - Research Engineer

Black Forest Labs•San Francisco, CA

2d•$180,000 - $290,000•Hybrid

About The Position

Black Forest Labs is the team behind foundational generative model technologies like Latent Diffusion, Stable Diffusion, and FLUX. They are creating advanced generative models used by millions worldwide. The company is headquartered in Freiburg, Germany, with a growing presence in San Francisco. They emphasize research excellence, open science, and expanding human creativity. This role is crucial for translating research ideas into reality through large-scale training, addressing complex systems and performance challenges in areas like attention performance, custom kernels, low-precision training, profiling, memory behavior, data movement, and distributed training stability. The position requires deep technical ownership and the ability to make progress in ambiguous training-system problems, verify results, and own outcomes. The company is open to a range of seniority levels for this role.

Requirements

Experience working deeply on large-scale training systems, ideally as part of a training group working closely with researchers.
Strong PyTorch fluency, including comfort reading and modifying low-level training code.
Experience with distributed training concepts such as FSDP, tensor/model/context/sequence parallelism, activation checkpointing, NCCL, and overlapping compute and communication.
Hands-on experience improving training throughput, memory footprint, or stability in real training runs.
Experience profiling GPU workloads with tools like Nsight Systems, Nsight Compute, torch profiler, trace viewers, or custom telemetry.
Practical GPU performance judgment: understanding to verify correctness, numerical behavior, and performance.
Understanding of low-precision training and quantization tradeoffs: FP8, MXFP8, FP4/NVFP4-style formats, scaling, accumulation, numerical validation, and convergence risk.
Good research judgment: ability to partner with researchers on ablations, understand measurement limitations, and tie optimization work to model-quality outcomes.
Comfortable operating in ambiguity: ability to handle clean implementations, production issues, and investigative tasks.
Ability to move naturally between profiler traces, kernel code, distributed systems failures, and research discussions.

Nice To Haves

Supported or co-owned training for a frontier foundation model that shipped or reached a major release.
Written or substantially improved forward/backward GPU kernels, or demonstrated progress on kernel-level work with strong measurement and validation discipline.
Worked on attention performance, variable sequence length training, or non-standard attention patterns.
Experience on Hopper or Blackwell-class GPUs.
Experience with diffusion, flow matching, DiT, and multimodal generative model training; excitement to learn the diffusion and multimodal modeling stack quickly if background is in autoregressive or LLM training systems.

Responsibilities

Improve the performance, reliability, and numerical stability of production training runs for large multimodal generative models.
Profile full training steps across model code, attention, kernels, data loading, encoders, communication, optimizer steps, checkpointing, and memory pressure.
Implement and validate GPU-level optimizations: fused kernels, attention paths, low-precision matmuls, quantization kernels, CUDA/Triton/CuTe/CUTLASS experiments, and no-compile alternatives.
Advance lower-precision training, including FP8 / MXFP8 / FP4-style paths, weight and activation quantization, accumulation choices, convergence risk, and quality tradeoffs.
Translate architecture changes into efficient training implementations and distinguish real model-quality progress from microbenchmark improvements.
Debug distributed training failures such as NaNs, loss spikes, numerical drift, memory leaks, stragglers, bad nodes, NCCL issues, and throughput cliffs.
Build benchmarking and profiling harnesses for trustworthy performance claims across hardware, shapes, sequence lengths, and training configurations.
Address urgent bottlenecks and transform repeated failures into improved abstractions and tools for the training team.