Member of Technical Staff, Inference

InferactSan Francisco, CA
43d$200,000 - $400,000Hybrid

About The Position

We're looking for an inference runtime engineer to push the boundaries of what's possible in LLM and diffusion model serving. Models grow larger. Architectures shift: mixture-of-experts, multimodal, agentic. Every breakthrough demands innovations on the inference engine itself. You'll work at the core of vLLM, optimizing how models execute across diverse hardware and architectures. Your work will directly impact how the world runs AI inference.

Requirements

  • Bachelor's degree or equivalent experience in computer science, engineering, or similar.
  • Deep understanding of transformer architectures and their variants.
  • Strong programming skills in Python with experience in PyTorch internals.
  • Experience with LLM inference systems (vLLM, TensorRT-LLM, SGLang, TGI).
  • Ability to read and implement model architectures and inference techniques from research papers.
  • Demonstrate the ability to contribute performant and maintainable code and debug in complex ML codebases.

Nice To Haves

  • Deep understanding of KV-cache memory management, prefix caching, and hybrid model serving.
  • Familiarity with RL frameworks and algorithms for LLMs.
  • Experience with multimodal inference (audio/image/video/text).
  • Contributions to open-source ML or system infrastructure projects.
  • Implemented core features in vLLM or other inference engine projects.
  • Contributed to vLLM integrations (verl, OpenRLHF, Unsloth, LlamaFactory, etc).
  • Written widely-shared technical blogs or side projects on vLLM or LLM inference.

Benefits

  • Inferact offers generous health, dental, and vision benefits as well as 401(k) company match.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service