Senior Software Engineer - AI Inference

NVIDIASanta Clara, CA

About The Position

NVIDIA is the platform upon which every new AI‑powered application is built. We are seeking a Senior Software Engineer – AI Inference to advance open‑source LLM serving by contributing directly to upstream inference engines like vLLM and SGLang-ensuring they run best‑in‑class on NVIDIA GPUs and systems-and by improving the underlying stack that enables high‑throughput, low‑latency inference at scale. This is a hands-on role for an engineer who enjoys digging into performance bottlenecks, designing pragmatic runtime improvements, and shipping high‑quality changes that are broadly useful to the community and production deployments.

Requirements

  • 5+ years building production software with solid systems engineering fundamentals and a track record of delivering performance or reliability improvements.
  • Experience with LLM inference/serving stacks (e.g., vLLM, SGLang) and an understanding of the tradeoffs that drive real production performance.
  • Strong programming skills in Python plus C++ and/or CUDA; ability to debug and optimize performance‑critical code.
  • Experience with profiling and performance investigation (microbenchmarks, flame graphs, GPU profiling) and a measurement‑driven mindset.
  • Familiarity with distributed systems concepts and concurrency (queues/schedulers, multi‑process/multi‑threading, scaling across GPUs/nodes).
  • Strong communication skills and comfort working with open‑source communities (issues, PR discussions, code review).
  • BS/MS in Computer Science, Computer Engineering, or related field (or equivalent experience).

Nice To Haves

  • Open‑source contributions to vLLM, SGLang, PyTorch, Triton, NCCL, Dynamo or adjacent serving/runtime projects.
  • Shipped performance work such as improved attention/KV cache efficiency, speculative decoding, scheduler improvements, quantization-aware serving, or streaming latency reductions.
  • Experience building reproducible benchmarking and performance regression infrastructure for latency/throughput.
  • Systems performance background spanning memory bandwidth, kernel fusion, PCIe/NVLink effects, and network fabrics (e.g., InfiniBand).

Responsibilities

  • Contribute features, fixes, and optimizations upstream to vLLM/SGLang: author PRs, participate in reviews, write benchmarks/tests, and help drive designs to completion.
  • Implement and optimize inference‑runtime capabilities: batching and scheduling policies, streaming, request lifecycle management, and KV‑cache efficiency (paging/sharding) to improve throughput and tail latency.
  • Profile and improve hot paths across layers-from Python orchestration to C++/CUDA kernels-using data to guide optimization work.
  • Improve multi‑GPU inference performance and reliability: parallelism strategies, communication patterns, and resource utilization across NVIDIA platforms.
  • Build and maintain performance and correctness regression tests to prevent slowdowns and ensure stable behavior across model and hardware configurations.
  • Collaborate with model, platform, and SRE teams to translate production requirements into upstreamable solutions with strong operability and maintainability.

Benefits

  • You will also be eligible for equity and benefits.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service