Sr. Principal Software Engineer

Cerence

1d•$141,400 - $226,300•Hybrid

About The Position

Cerence AI is the global leader in AI for transportation, specialized in building AI and voice-powered companions for cars, two-wheelers, and more that enable people to focus on what matters most. With over 500 million cars shipped with Cerence AI's technology, we partner with leading automakers (such as Volkswagen, Mercedes, Audi, Toyota and many more), mobility providers, and technology companies to power intuitive, integrated experiences that create safer, more connected, and more enjoyable journeys for drivers and passengers alike. Our team is dedicated to pushing the boundaries of AI innovation, working around the globe with headquarters in Burlington, Massachusetts, USA and 16 other offices across Europe, Asia, and North America. We bring together diverse backgrounds, and varied skill sets with the shared goal of advancing the next generation of transportation user experiences. Our culture is customer-centric, collaborative, fast-paced, and fun, with continuous opportunities for learning and development to support your career growth. We’re looking for an exceptional Senior Principal AI Scientist in Generative AI who is ready to drive the future of mobility with us!

Requirements

Proven experience optimizing ML inference performance in production
Deep understanding of GPU architecture and memory hierarchies
Hands‑on experience with CUDA and low‑level performance tuning
Experience deploying models beyond research environments
Inference engines: vLLM, TensorRT‑LLM, llama.cpp, QAIRT
CUDA kernel development and profiling
Quantisation techniques: INT8/INT4/FP4/FP8, AWQ, GPTQ
KV cache optimisation and memory layout design
Latency optimisation: batching, speculative decoding, continuous batching

Nice To Haves

Models deploy efficiently on edge and embedded devices, not just servers
Tokens/sec significantly outperform baseline implementations
End‑to‑end latency is minimized and predictable
Inference cost per request is materially reduced

Responsibilities

Optimize and deploy high‑performance LLM inference pipelines
Own inference runtimes across data center, edge, and embedded platforms
Push model performance through quantization, kernel fusion, and cache optimization
Drive latency and throughput improvements that directly impact production products
Enable efficient, reliable deployment without external vendor dependency
Build deep expertise and ownership of: vLLM, TensorRT‑LLM, llama.cpp, QAIRT
Extend and tune inference engines using custom CUDA kernels
Adapt runtimes for constrained and embedded deployment environments
Implement and evaluate quantisation strategies: INT8, INT4, FP4, FP8, mixed precision AWQ GPTQ
Balance accuracy, latency, memory footprint, and throughput
Optimize key–value cache performance through: Paging, Prefix caching, Cache‑aware memory layout design
Reduce memory pressure while sustaining high throughput
Design and tune: Batching strategies, Continuous batching, Speculative decoding
Optimize tail latency and tokens/sec under real production traffic patterns