Machine Learning Engineer, Inference & Serving (Speech LLM) - San Francisco

PlaudSan Francisco, CA
$180,000 - $270,000Hybrid

About The Position

Plaud is building the world's most trusted AI work companion for professionals to elevate productivity and performance through note-taking solutions, loved by over 1,500,000 users worldwide since 2023. With a mission to amplify human intelligence, Plaud is building the next-generation intelligence infrastructure and interfaces to capture, extract, and utilize what you say, hear, see, and think. Plaud Inc. is a Delaware-incorporated, San Francisco-based company pushing the boundary of human–AI intelligence through a hardware–software combination. With SOC 2, HIPAA, GDPR, ISO27001, ISO27701, and EN18031 compliance, Plaud is committed to the highest standards of data security and privacy protection.

Requirements

  • Hands-on experience building and deploying high-throughput, ultra-low-latency inference engines for large language models or foundational speech models.
  • Understanding of the intricate tradeoffs between latency, throughput, and Time-To-First-Token (or Time-To-First-Audio) in real-time streaming environments.
  • Practical experience with continuous batching, KV cache management (e.g., PagedAttention), and stateful connections necessary for real-time conversational AI.
  • Deep understanding of GPU architectures (NVIDIA Ampere/Hopper) and the memory hierarchy, allowing you to identify and eliminate hardware bottlenecks.
  • Ability to communicate clearly and collaborate effectively, as you will sit at the critical intersection between the core ML training team and the backend infrastructure team.
  • Thrive in fast-moving environments and genuinely enjoy the systems-engineering challenge of squeezing every last drop of performance out of a cluster of GPUs.
  • Obsessed with building AI systems that natively understand and generate speech, ultimately creating a hardware-software AI companion that amplifies human productivity.

Nice To Haves

  • Deep, under-the-hood familiarity with modern LLM serving frameworks like vLLM, TensorRT-LLM, SGLang, or NVIDIA Triton Inference Server (bonus points for active open-source contributions to these repositories).
  • Experience handling continuous audio streams over WebSockets or WebRTC, deploying neural audio codecs, and managing chunked audio generation to minimize conversational latency.
  • Implementing cutting-edge generation algorithms such as speculative decoding, lookahead decoding, or chunked prefill.
  • Hands-on experience with post-training quantization (PTQ), deploying models in FP8, INT8, AWQ, or GPTQ, without degrading audio naturalness or ASR accuracy.
  • Deploying multi-GPU (Tensor Parallelism) and multi-node inference pipelines, and managing autoscaling infrastructure using Kubernetes.

Responsibilities

  • Building and deploying high-throughput, ultra-low-latency inference engines for large language models or foundational speech models.
  • Understanding the intricate tradeoffs between latency, throughput, and Time-To-First-Token (or Time-To-First-Audio) in real-time streaming environments.
  • Implementing continuous batching, KV cache management (e.g., PagedAttention), and stateful connections necessary for real-time conversational AI.
  • Identifying and eliminating hardware bottlenecks by deeply understanding GPU architectures (NVIDIA Ampere/Hopper) and their memory hierarchy.
  • Communicating clearly and collaborating effectively with the core ML training team and the backend infrastructure team.
  • Squeezing every last drop of performance out of a cluster of GPUs in fast-moving environments.
  • Building AI systems that natively understand and generate speech, creating a hardware-software AI companion that amplifies human productivity.
  • Deploying modern LLM serving frameworks like vLLM, TensorRT-LLM, SGLang, or NVIDIA Triton Inference Server.
  • Handling continuous audio streams over WebSockets or WebRTC, deploying neural audio codecs, and managing chunked audio generation to minimize conversational latency.
  • Implementing cutting-edge generation algorithms such as speculative decoding, lookahead decoding, or chunked prefill.
  • Deploying models in FP8, INT8, AWQ, or GPTQ without degrading audio naturalness or ASR accuracy.
  • Deploying multi-GPU (Tensor Parallelism) and multi-node inference pipelines.
  • Managing autoscaling infrastructure using Kubernetes.

Benefits

  • Opportunity to be an early, foundational member of our core SpeechLLM lab, with meaningful ownership and impact on a fast-growing startup.
  • $180K - $270K base salary + performance bonus + Equity.
  • Top-tier healthcare for employees and dependents, including dental and vision, and a generous employer subsidy.
  • 401(k) plan for full-time employees with company matching.
  • Unlimited PTO, plus 13 paid holidays.
  • 12 weeks of paid time off to spend time with your new family, regardless of gender.
  • Choice of top-of-the-line laptops/workstations, annual offsites, and a fully stocked office.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service