Director of Engineering - AI Inferences

WEKA

52d

About The Position

WEKA is architecting a new approach to the enterprise data stack built for the age of reasoning. NeuralMesh by WEKAsets the standard for agentic AI data infrastructure with a cloud and AI-native software solution that can be deployed anywhere. It transforms legacy data silos into data pipelines that dramatically increase GPU utilization and make AI model training and inference, machine learning, and other compute-intensive workloads run faster, work more efficiently, and consume less energy. WEKA is a pre-IPO, growth-stage company on a hyper-growth trajectory. We’ve raised $375M in capital with dozens of world-class venture capital and strategic investors. We help the world’s largest and most innovative enterprises and research organizations, including 12 of the Fortune 50, achieve discoveries, insights, and business outcomes faster and more sustainably. We’re passionate about solving our customers’ most complex data challenges to accelerate intelligent innovation and business value. If you share our passion, we invite you to join us on this exciting journey. Requirements: We are seeking a Director of Engineering - AI Inferences to spearhead our AI Inference team. In this role, you will bridge the gap between complex research and production-grade engineering. You will lead a tight-knit squad of 3 developers while remaining "hands-on-keyboard," architecting high-performance systems that optimize Large Language Model (LLM) serving. The ideal candidate is deeply invested in inference and scale ,and the evolving ecosystem of serving frameworks like vLLM and LMCache.

Requirements

AI Inference Domain: Proven experience with KV cache reuse, speculative decoding, and continuous batching.
Specific Stack: Deep familiarity with vLLM, LMCache, and NIXL. Understanding the trade-offs between centralized vs. distributed caching.
Backend Engineering: Expertise in Python, C++, or Rust, with a strong grasp of CUDA and GPU memory management.
Infrastructure: Experience with Kubernetes (K8s) for scaling GPU workloads and optimizing cold-start times.

Responsibilities

Technical Leadership: Architect and oversee the deployment of high-throughput, low-latency LLM inference pipelines.
Team Management: Mentor and lead a small team of developers, conducting code reviews, sprint planning, and technical career coaching.
Inference Optimization: Implement and evaluate state-of-the-art KV cache management solutions, including LMCache, and explore alternatives to minimize redundant computation.
Framework Mastery: Deeply integrate and optimize serving engines such as vLLM, LLM-d, and NIXL to maximize hardware utilization.
R&D: Stay at the forefront of the "Inference-as-a-Service" domain, benchmarking new tools and deciding when to pivot the stack.