Principal LLM Inference Engineer

d-Matrix•Santa Clara, CA

3d•$195,000 - $285,000•Onsite

About The Position

At d-Matrix, we are focused on unleashing the potential of generative AI to power the transformation of technology. We are at the forefront of software and hardware innovation, pushing the boundaries of what is possible. Our culture is one of respect and collaboration. We value humility and believe in direct communication. Our team is inclusive, and our differing perspectives allow for better solutions. We are seeking individuals passionate about tackling challenges and are driven by execution. Ready to come find your playground? Together, we can help shape the endless possibilities of AI. D-Matrix Frontier Group sits at the leading edge of what’s possible with LLM inference on heterogeneous hardware. Our charter spans the full stack: from pathfinding emerging use cases and novel deployment patterns to deep optimization of inference kernels, to building proof-of-concept systems that showcase D-Matrix’s unique computational fabric. We are an applied research and engineering team that moves fast, ships real systems, and works directly with product and hardware teams to shape the roadmap. We build the tools, runtimes, and frameworks that let frontier AI models run efficiently and cost-effectively across heterogeneous deployments — combining D-Matrix silicon with CPUs, GPUs, and custom accelerators. Our work powers everything from benchmarking and evaluation pipelines to production-grade inference serving. This Role We are hiring end-to-end inference engineers who are comfortable going from a novel research idea to a deployed, optimized system. You will work at every layer of the inference stack — from kernel-level optimization to distributed orchestration to high-level serving APIs. This role could be a great match for you if you: • Have deep intuition for modern generative AI architectures and how to squeeze performance out of them at inference time. • Are familiar with the internals of open-source inference frameworks (vLLM, SGLang, TensorRT-LLM, etc.) and can extend or replace them when needed. • Enjoy pathfinding new use cases — exploring heterogeneous deployment topologies and building early-stage POCs that prove out new ideas. • Are results-oriented with a strong bias toward action; you own problems end-to-end from prototype to optimization to handoff. • Are energized by working at the intersection of novel hardware and frontier models, and want your work to directly influence how next-generation AI silicon is used. • Value clear communication and thrive in a small, high-ownership team environment.

Requirements

Bachelor’s degree in Computer Science, Electrical Engineering, or a related field, and 10+ years of relevant engineering experience; or equivalent demonstrated experience.
Strong proficiency in Python and C/C++.
Hands-on experience optimizing LLM inference — attention kernels, KV cache, batching strategies, quantization (INT8/FP8/INT4).
Experience with at least one major inference framework (vLLM, SGLang, TensorRT-LLM, ONNX Runtime, or similar) at a contributor level.
Familiarity with GPU kernel programming (CUDA/Triton) and performance profiling tools.

Nice To Haves

Master’s or PhD in Computer Science, Electrical Engineering, or a related field preferred, with 6+ years of relevant industry experience.
Experience with heterogeneous compute deployments — scheduling inference workloads across dissimilar hardware (accelerators, CPUs, GPUs).
Familiarity with custom silicon or ASIC-based inference (beyond GPU-only environments).
Experience with distributed inference: tensor parallelism, pipeline parallelism, disaggregated serving.
Contributions to open-source inference or ML systems projects.
Experience with production inference serving at scale (latency SLOs, continuous batching, multi-model serving).
Familiarity with speculative decoding, mixture-of-experts routing, or long-context serving techniques.
Working familiarity with the material in the JAX Scaling Book or equivalent systems-level understanding of modern LLM training and inference.

Responsibilities

Identify and prototype emerging LLM inference use cases suited to heterogeneous hardware deployments.
Build compelling proof-of-concept systems that demonstrate D-Matrix capabilities to customers, partners, and internal stakeholders.
Develop and tune custom kernels and operator-level optimizations to maximize throughput and minimize latency.
Drive quantization, sparsity, and batching strategies tailored to D-Matrix computational model.
Build and maintain inference runtimes, serving frameworks, and evaluation tooling.
Contribute to distributed inference systems: tensor/pipeline parallelism, disaggregated prefill/decode, KV-cache management.
Work closely with hardware architects to provide firmware and compiler teams with actionable inference workload insights.
Partner with product and business development to translate POCs into customer-facing demonstrations.
Contribute to technical publications, whitepapers, and open-source projects that advance D-Matrix visibility.