Tech Lead - AI Inference

WEKA

10d

About The Position

We are seeking a Tech Lead to lead our AI Inference team. In this role, you will bridge the gap between complex research and production-grade engineering, while cultivating a high-performing team culture. You will lead and grow a squad of 3 developers, balancing hands-on technical contribution with strong people leadership — setting direction, unblocking your team, and driving execution on high-performance systems that optimize Large Language Model (LLM) serving. The ideal candidate combines deep technical expertise in inference and scale with the leadership maturity to mentor, motivate, and develop engineers in the evolving ecosystem of serving frameworks like vLLM and LMCache.

Requirements

Experienced Engineering Leader: 5+ years of professional software engineering, with proven experience leading engineers and owning complex production systems — ideally in AI/ML infrastructure or high-performance computing.
Deep AI Inference Background: Hands-on expertise with LLM serving systems — KV cache reuse, disaggregated prefill/decode, continuous batching, and multi-tier GPU memory hierarchies (HBM → NVMe). Strong familiarity with vLLM, LMCache, NIXL/NVIDIA Dynamo, or similar frameworks.
Systems Engineering Depth: Strong Python and C++ skills (Rust a plus), with a solid grasp of CUDA, GPU memory management, and high-performance I/O — including GPUDirect Storage (GDS), RDMA, and NVMe data paths.
Infrastructure Fluency: Experience deploying and scaling GPU workloads on Kubernetes, with familiarity in RDMA networking, bare-metal GPU clusters (H100/A100), and high-throughput distributed storage.
People Leadership: Demonstrated ability to mentor and develop engineers — running effective 1:1s, supporting career growth, and balancing technical execution with long-term team health.
High Bar for Quality: A strong sense of engineering craftsmanship, with a track record of building reliable, high-throughput systems and continuously improving engineering practices.

Nice To Haves

Rust a plus

Responsibilities

Lead & Own: Take end-to-end ownership of AMG's core inference infrastructure — from the NVMe Token Warehouse and GDS data paths to the vLLM/LMCache serving stack — driving technical decisions and delivery outcomes.
Technical Direction: Guide a team of engineers through design, implementation, and delivery of high-throughput, low-latency LLM inference systems, setting high standards for code quality, architecture, and reliability.
Build at Scale: Stay hands-on across the AMG stack (Python, C++, CUDA, vLLM, NIXL/Dynamo, Kubernetes), contributing directly to production systems while providing technical leadership to the team.
Solve Hard Problems: Tackle the real frontier challenges of inference engineering — disaggregated prefill/decode, persistent off-HBM KV caching, RDMA-based transport, and multi-tier GPU memory hierarchies — that define what's possible at scale.
Grow People & Teams: Mentor and coach engineers through regular 1:1s, career coaching, and sprint reviews. Foster a culture of ownership, collaboration, and technical excellence within the AMG team.
Stay on the Frontier: Track the evolving inference ecosystem, benchmark new tools (SGLang, TRT-LLM, NVIDIA Dynamo), and help the team make timely decisions about when to adopt, build, or pivot.