Member of Technical Staff - Research Engineer

Black Forest LabsSan Francisco, CA
Hybrid

About The Position

Black Forest Labs is the team behind Latent Diffusion, Stable Diffusion, and FLUX, foundational technologies that have transformed image and video creation. They develop generative models used by millions globally. Their FLUX models are state-of-the-art, and the company is rapidly expanding. With headquarters in Freiburg, Germany, and a presence in San Francisco, Black Forest Labs emphasizes research excellence, open science, and fostering human creativity. This role is crucial for translating research ideas into reality through large-scale training, addressing complex systems and performance challenges in GPU clusters. The position involves working closely with researchers, producing code, measurements, kernels, debugging tools, and training system changes to enable advanced research. The company is open to various seniority levels, seeking individuals with deep technical ownership who can navigate ambiguous problems, verify results, and take responsibility for outcomes.

Requirements

  • Experience working deeply on large-scale training systems, ideally as part of a training group working closely with researchers
  • Strong PyTorch fluency, including comfort reading and modifying low-level training code rather than only using high-level APIs
  • Experience with distributed training concepts such as FSDP, tensor/model/context/sequence parallelism, activation checkpointing, NCCL, and overlapping compute and communication
  • Hands-on experience improving training throughput, memory footprint, or stability in real training runs
  • Experience profiling GPU workloads with tools like Nsight Systems, Nsight Compute, torch profiler, trace viewers, or custom telemetry
  • Practical GPU performance judgment: you may use modern coding agents and tools as much as you want, but you need the understanding to verify correctness, numerical behavior, and performance, and to own the result
  • Understanding of low-precision training and quantization tradeoffs: FP8, MXFP8, FP4/NVFP4-style formats, scaling, accumulation, numerical validation, and convergence risk
  • Good research judgment: you can partner with researchers on ablations, understand what the measurements do and do not prove, and keep optimization work tied to model-quality outcomes
  • Comfortable operating in ambiguity: sometimes the task is a clean implementation, sometimes it is a production fire, and sometimes it is figuring out which of three plausible explanations is actually true

Nice To Haves

  • Supported or co-owned training for a frontier foundation model that shipped or reached a major release
  • Written or substantially improved forward/backward GPU kernels, or have shown you can make progress on kernel-level work with strong measurement and validation discipline
  • Worked on attention performance, variable sequence length training, non-standard attention patterns
  • Experience on Hopper or Blackwell-class GPUs
  • Worked on low-precision training
  • Experience with diffusion, flow matching, DiT, and multimodal generative model training; if your deepest background is autoregressive or LLM training systems, you are excited to learn the diffusion and multimodal modeling stack quickly
  • Can move naturally between profiler traces, kernel code, distributed systems failures, and research discussions

Responsibilities

  • Improve the performance, reliability, and numerical stability of production training runs for large multimodal generative models
  • Profile full training steps across model code, attention, kernels, data loading, encoders, communication, optimizer steps, checkpointing, and memory pressure
  • Implement and validate GPU-level optimizations: fused kernels, attention paths, low-precision matmuls, quantization kernels, CUDA/Triton/CuTe/CUTLASS experiments, and no-compile alternatives where they make sense
  • Push lower-precision training forward, including FP8 / MXFP8 / FP4-style paths, weight and activation quantization, accumulation choices, convergence risk, and quality tradeoffs against baseline training runs
  • Work with researchers to translate architecture changes into efficient training implementations, and help distinguish real model-quality progress from changes that only look good in a microbenchmark
  • Debug distributed training failures: NaNs, loss spikes, silent numerical drift, memory leaks, stragglers, bad nodes, NCCL issues, and throughput cliffs
  • Build benchmarking and profiling harnesses that make performance claims trustworthy across hardware, shapes, sequence lengths, and training configurations
  • Help the training team move quickly when an urgent bottleneck appears, while turning repeated failures into better abstractions and tools

Benefits

  • Equity
  • Travel costs covered for in-person weeks
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service