Machine Learning Engineer, Speech LLM Training - San Francisco

Plaud•San Francisco, CA

14h•Hybrid

About The Position

Plaud is building the world's most trusted AI work companion for professionals to elevate productivity and performance through note-taking solutions, loved by over 1,500,000 users worldwide since 2023. With a mission to amplify human intelligence, Plaud is building the next-generation intelligence infrastructure and interfaces to capture, extract, and utilize what you say, hear, see, and think. Plaud Inc. is a Delaware-incorporated, San Francisco-based company pushing the boundary of human–AI intelligence through a hardware–software combination. With SOC 2, HIPAA, GDPR, ISO27001, ISO27701, and EN18031 compliance, Plaud is committed to the highest standards of data security and privacy protection.

Requirements

Proven track record of building and training large-scale audio or speech models from the ground up, whether that involves unified SpeechLLMs, advanced ASR, expressive TTS, or generative audio architectures.
Love living at the intersection of research and engineering, eager to design novel sequence modeling architectures one day and debug distributed training clusters the next.
Highly comfortable traversing the entire stack—from fundamental signal processing and raw acoustic representations to massive foundation model training and edge-device optimization.
Deep expertise in PyTorch or JAX, with battle scars from optimizing large-scale distributed training runs, managing GPU memory utilization, and resolving complex performance bottlenecks.
Thrive in a fast-paced, high-growth startup environment where you are expected to take extreme ownership of ambiguous problems and drive them directly into production.
Obsessed with building AI systems that natively understand and generate speech, ultimately creating a hardware-software AI companion that amplifies human productivity.

Nice To Haves

Text-based LLMs: Hands-on experience with core text-based Large Language Model pretraining, instruction tuning, or RLHF.
Neural Audio Codecs: Hands-on experience designing and training state-of-the-art neural audio codecs for streamable, high-fidelity audio.
Generative Architectures: Designing and training diffusion models, flow matching, or autoregressive architectures specifically for speech and voice generation.
Alignment & Steerability: Applying Reinforcement Learning (RL) techniques (like RLHF or GRPO) to improve conversational cadence, steerability, and alignment in foundation models.
Deep System Optimization: End-to-end inference and performance optimization, leveraging high-throughput serving frameworks (e.g., vLLM, TensorRT-LLM, SGLang) to minimize latency for real-time cloud streaming.
Large-Scale Infrastructure: Managing massive GPU clusters, utilizing advanced distributed training frameworks (e.g., FSDP, DeepSpeed), and navigating orchestration tools like Kubernetes.

Responsibilities

Building and training large-scale audio or speech models from the ground up, including unified SpeechLLMs, advanced ASR, expressive TTS, or generative audio architectures.
Designing novel sequence modeling architectures.
Debugging distributed training clusters.
Traversing the entire stack from fundamental signal processing and raw acoustic representations to massive foundation model training and edge-device optimization.
Optimizing large-scale distributed training runs.
Managing GPU memory utilization.
Resolving complex performance bottlenecks.
Taking extreme ownership of ambiguous problems and driving them directly into production.
Building AI systems that natively understand and generate speech, ultimately creating a hardware-software AI companion that amplifies human productivity.
Hands-on experience with core text-based Large Language Model pretraining, instruction tuning, or RLHF.
Hands-on experience designing and training state-of-the-art neural audio codecs for streamable, high-fidelity audio.
Designing and training diffusion models, flow matching, or autoregressive architectures specifically for speech and voice generation.
Applying Reinforcement Learning (RL) techniques (like RLHF or GRPO) to improve conversational cadence, steerability, and alignment in foundation models.
End-to-end inference and performance optimization, leveraging high-throughput serving frameworks (e.g., vLLM, TensorRT-LLM, SGLang) to minimize latency for real-time cloud streaming.
Managing massive GPU clusters, utilizing advanced distributed training frameworks (e.g., FSDP, DeepSpeed), and navigating orchestration tools like Kubernetes.

Benefits

Opportunity to be an early, foundational member of our core SpeechLLM lab, with meaningful ownership and impact on a fast-growing startup.
$180K - $270K base salary + performance bonus + Equity.
Top-tier healthcare for employees and dependents, including dental and vision, and a generous employer subsidy.
401(k) plan for full-time employees with company matching.
Unlimited PTO, plus 13 paid holidays.
12 weeks of paid time off to spend time with your new family, regardless of gender.
Choice of top-of-the-line laptops/workstations, annual offsites, and a fully stocked office.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume