Machine Learning Engineer, Inference & Serving (Speech LLM) - San Francisco

Plaud•San Francisco, CA

13h•$180,000 - $270,000•Hybrid

About The Position

Plaud is building the world's most trusted AI work companion for professionals to elevate productivity and performance through note-taking solutions, loved by over 1,500,000 users worldwide since 2023. With a mission to amplify human intelligence, Plaud is building the next-generation intelligence infrastructure and interfaces to capture, extract, and utilize what you say, hear, see, and think. Plaud Inc. is a Delaware-incorporated, San Francisco-based company pushing the boundary of human–AI intelligence through a hardware–software combination. With SOC 2, HIPAA, GDPR, ISO27001, ISO27701, and EN18031 compliance, Plaud is committed to the highest standards of data security and privacy protection.

Requirements

Hands-on experience building and deploying high-throughput, ultra-low-latency inference engines for large language models or foundational speech models.
Understanding of the intricate tradeoffs between latency, throughput, and Time-To-First-Token (or Time-To-First-Audio) in real-time streaming environments.
Practical experience with continuous batching, KV cache management (e.g., PagedAttention), and stateful connections necessary for real-time conversational AI.
Deep understanding of GPU architectures (NVIDIA Ampere/Hopper) and the memory hierarchy, allowing you to identify and eliminate hardware bottlenecks.
Ability to communicate clearly and collaborate effectively, as you will sit at the critical intersection between the core ML training team and the backend infrastructure team.
Thrive in fast-moving environments and genuinely enjoy the systems-engineering challenge of squeezing every last drop of performance out of a cluster of GPUs.
Obsessed with building AI systems that natively understand and generate speech, ultimately creating a hardware-software AI companion that amplifies human productivity.

Nice To Haves

Deep, under-the-hood familiarity with modern LLM serving frameworks like vLLM, TensorRT-LLM, SGLang, or NVIDIA Triton Inference Server (bonus points for active open-source contributions to these repositories).
Experience handling continuous audio streams over WebSockets or WebRTC, deploying neural audio codecs, and managing chunked audio generation to minimize conversational latency.
Implementing cutting-edge generation algorithms such as speculative decoding, lookahead decoding, or chunked prefill.
Hands-on experience with post-training quantization (PTQ), deploying models in FP8, INT8, AWQ, or GPTQ, without degrading audio naturalness or ASR accuracy.
Deploying multi-GPU (Tensor Parallelism) and multi-node inference pipelines, and managing autoscaling infrastructure using Kubernetes.

Responsibilities

Building and deploying high-throughput, ultra-low-latency inference engines for large language models or foundational speech models.
Understanding the intricate tradeoffs between latency, throughput, and Time-To-First-Token (or Time-To-First-Audio) in real-time streaming environments.
Implementing continuous batching, KV cache management (e.g., PagedAttention), and stateful connections necessary for real-time conversational AI.
Identifying and eliminating hardware bottlenecks by deeply understanding GPU architectures (NVIDIA Ampere/Hopper) and their memory hierarchy.
Communicating clearly and collaborating effectively with the core ML training team and the backend infrastructure team.
Squeezing every last drop of performance out of a cluster of GPUs in fast-moving environments.
Building AI systems that natively understand and generate speech, creating a hardware-software AI companion that amplifies human productivity.
Deploying modern LLM serving frameworks like vLLM, TensorRT-LLM, SGLang, or NVIDIA Triton Inference Server.
Handling continuous audio streams over WebSockets or WebRTC, deploying neural audio codecs, and managing chunked audio generation to minimize conversational latency.
Implementing cutting-edge generation algorithms such as speculative decoding, lookahead decoding, or chunked prefill.
Deploying models in FP8, INT8, AWQ, or GPTQ without degrading audio naturalness or ASR accuracy.
Deploying multi-GPU (Tensor Parallelism) and multi-node inference pipelines.
Managing autoscaling infrastructure using Kubernetes.

Benefits

Opportunity to be an early, foundational member of our core SpeechLLM lab, with meaningful ownership and impact on a fast-growing startup.
$180K - $270K base salary + performance bonus + Equity.
Top-tier healthcare for employees and dependents, including dental and vision, and a generous employer subsidy.
401(k) plan for full-time employees with company matching.
Unlimited PTO, plus 13 paid holidays.
12 weeks of paid time off to spend time with your new family, regardless of gender.
Choice of top-of-the-line laptops/workstations, annual offsites, and a fully stocked office.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume