Inference Performance Engineer

Material Group•New York, NY

About The Position

Serving frontier models at scale requires solving novel systems problems at every layer of the stack. As an Inference Performance Engineer, you'll own the runtime that turns accelerators into a production serving system, optimizing throughput, latency, and cost across thousands of nodes. You'll work alongside hardware and compiler teams operating at the frontier of AI silicon design.

Requirements

BS in CS, EE, or related field, or equivalent experience
Software engineering experience: Rust, Go, Python, or C++
Understanding of concurrency, memory, and tail latency
Understanding of modern inference: transformers, attention, KV cache, batching, speculative decoding, quantization
Experience with model serving frameworks: vLLM, TGI, SGLang, TensorRT-LLM, llama.cpp, or custom runtimes
GPU or ASIC programming experience: CUDA, ROCm, Triton, or vendor-native toolchains
Experience with low-precision inference (FP8, FP4, INT4)
Profiling and benchmarking experience: Nsight, perf, custom harnesses

Responsibilities

Build and improve the inference runtime
Design scheduling, continuous batching, KV cache, and prefill/decode disaggregation
Implement low-precision kernels and speculative decoding
Drive throughput, latency, and cost per token
Collaborate with hardware teams on kernels, operators, and graph optimizations
Own the OpenAI-compatible API surface and serving protocol
Build benchmarking, profiling, and regression infrastructure