About The Position

We are now looking for a Senior Software Engineer for Quantized Inference! NVIDIA is seeking software engineers to accelerate the discovery and deployment of efficient inference recipes for LLMs. A recipe defines which operators are transformed into low-precision or sparsified variants — unlocking throughput and latency gains without regressing accuracy or verbosity. Recipes may incorporate techniques such as rotations, block scaling to attenuate outlier impact, or improved calibration data drawn from SFT/RL pipelines. Each new recipe demands corresponding kernel and model-level implementations in inference engines (vLLM, TRT-LLM, SGLang). The candidate will translate recipe specifications into functionally correct, performant code, e.g., writing Triton kernels, inserting quantize/dequantize nodes into prefill and decode paths, and ensuring per-expert scaling in MoE layers is handled correctly. From there, the candidate will collaborate with partner inference teams to further optimize throughput and interactivity on target workloads. This work is a core component of our productization effort across Megatron-LM, ModelOpt, and vLLM.

Requirements

  • Proficient in Python; familiarity with C++
  • Strong software engineering fundamentals: concise, well-tested code; fluent with AI-assisted tooling
  • Experience with ML accelerators with a basic understanding of how certain ML layers affect execution time
  • Familiarity with PyTorch internals (custom ops, autograd, export) or equivalent framework
  • Experience reading, modifying, or contributing to a large open-source codebase
  • MS/PhD in Computer Science or related field, or equivalent experience.
  • 4+ years in a relevant software engineering role
  • Demonstrated ability to move fast with ambiguous requirements, with strong written and verbal communication

Nice To Haves

  • Experience contributing to inference serving frameworks (vLLM, TRT-LLM, SGLang) or Triton kernel development
  • Track record of debugging numerical issues across mixed-precision boundaries
  • Deep experience with model compression techniques: PTQ, QAT, structured/unstructured sparsity

Responsibilities

  • Implement quantized and sparse recipes in inference engines (vLLM, TRT-LLM, SGLang)
  • Own model export pipelines (ModelOpt, Megatron-LM <-> HuggingFace), ensuring quantized checkpoints serialize correctly for downstream serving
  • Build prototypes and benchmarking harnesses to evaluate recipe throughput/interactivity before full optimization
  • Develop data analysis tooling and visualizations for numerics debugging
  • Improve developer productivity across the team: CI, build systems, training infrastructure, pipeline friction
  • Participate in code reviews and incorporate feedback

Benefits

  • You will also be eligible for equity and benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service