AI Engineer - Model Performance

Fathom•San Francisco, CA

9d•Remote

About The Position

We're hiring a Model Performance Engineer to own the speed, cost, and reliability of our model inference stack, and to build the fine-tuning infrastructure that makes the rest of the AI team faster. This is not a research role. You'll be optimizing real systems serving millions of meetings — choosing between quantization trade-offs, debugging speculative decoding, or figuring out why one GPU family's tail latency explodes at high concurrency while another stays stable. You'll own two things: 1. Inference performance. You'll make our models faster and cheaper — speculative decoding, quantization, serving configuration, GPU selection, batching strategies, cold start mitigation, adapter swapping. Our traffic is extremely spiky (meetings end in 30-minute blocks), so you need to think about throughput curves. Our team greatly values offering a fast product. 2. Fine-tuning pipelines. The AI team constantly fine-tunes models for new tasks — distilling large teacher models for classification, training adapters for domain-specific behavior, DPO for preference tuning. Right now each project reinvents the training loop. You'll build repeatable infrastructure so an AI Engineer can go more quickly from dataset to deployed model.

Requirements

Deep experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — not just deploying them, but tuning them: attention backends, scheduling strategies, CUDA graph warmup, prefix caching
Hands-on quantization experience — you've gone beyond "apply FP8 and hope." You understand weight vs activation quantization, per-channel vs per-tensor scaling, and when dynamic quantization introduces more overhead than it saves
Production fine-tuning experience — LoRA/QLoRA SFT, familiarity with training frameworks (ms-swift, Axolotl, torchtune, or similar), understanding of data formatting, learning rate schedules, and how to diagnose training failures
Strong Python. You'll write serving infrastructure, benchmarking harnesses, and training pipelines — not notebooks
Comfort with GPU profiling and performance analysis. You should be able to look at a benchmark result and know whether the bottleneck is compute, memory bandwidth, or scheduling overhead
Cost modeling for GPU infrastructure — you've had to choose between GPU types and justify the tradeoff
Experience with multimodal models (audio/vision encoders + LLM decoders)
Experience with Modal, Ray Serve, or similar serverless GPU platforms
Understanding of audio processing (codecs, chunking, sample rates)
Experience building internal tooling that other engineers use — this role succeeds when the rest of the team ships faster

Nice To Haves

ML research background or publications
Prompt engineering expertise (we have a team for that)
Frontend or full-stack experience
Masters/PhD (though it's fine if you have one)

Responsibilities

Benchmark FP8 quantization across GPU families, find that FP8 KV cache causes catastrophic repetition loops, identify static quantization as 6% faster than dynamic on certain hardware, and ship a production config that gets 1.3x speedup with <1% quality degradation
Evaluate serving frameworks (vLLM vs SGLang) with speculative decoding — discover that ngram speculation degrades ASR quality while EAGLE3 draft models don't, and that torch.compile makes certain GPUs 7% slower
Build a fine-tuning pipeline that takes a JSONL dataset and produces an optimized tune ready for serving, so a teammate can train a small classifier in an afternoon instead of a week
Optimize GPU spend — know which GPU families are best for batch workloads (stable under high concurrency) vs latency-sensitive paths (40% faster, but tail latency blows up under load), and when a 30% cost premium isn't worth it
Debug production inference issues — trace a quality regression to a serving framework upgrade that changed the default attention backend, or find that audio format handling in the multimodal pipeline silently drops segments