Machine Learning Engineer, Model Evaluations (Speech LLM) - San Francisco

PlaudSan Francisco, CA
$180,000 - $270,000Hybrid

About The Position

Plaud is building the world's most trusted AI work companion for professionals to elevate productivity and performance through note-taking solutions, loved by over 1,500,000 users worldwide since 2023. With a mission to amplify human intelligence, Plaud is building the next-generation intelligence infrastructure and interfaces to capture, extract, and utilize what you say, hear, see, and think. Plaud Inc. is a Delaware-incorporated, San Francisco-based company pushing the boundary of human–AI intelligence through a hardware–software combination. With SOC 2, HIPAA, GDPR, ISO27001, ISO27701, and EN18031 compliance, Plaud is committed to the highest standards of data security and privacy protection.

Requirements

  • Strong software engineering skills (especially in Python)
  • Experience building reliable distributed systems, data pipelines, or evaluation harnesses that can run at scale against live model checkpoints.
  • Ability to partner with ML researchers to define measurable benchmarks for Speech LLMs.
  • Experience building and owning dashboards that track model health during training.
  • Ability to rapidly debug anomalous mid-training results.
  • Ability to communicate complex statistical results and model behaviors clearly to both technical and non-technical stakeholders.

Nice To Haves

  • Deep familiarity with both traditional (WER, CER, PESQ, etc) and modern audio evaluation frameworks (automated MOS scoring).
  • Experience using frontier models or finetuning multi-modal LLMs to evaluate the conversational logic, transcription accuracy, audio quality, and reasoning of audio models.
  • Experience managing large-scale crowdsourcing operations or preference data collection to support RLHF/DPO efforts.
  • A strong background in statistics and experimental design, paired with experience building trusted tracking dashboards (e.g., Weights & Biases, MLflow).
  • Experience curating complex datasets to test edge cases, such as heavy accents, overlapping speech, or highly noisy acoustic environments.

Responsibilities

  • Have a passion for turning ambiguous, subjective concepts like a voice's naturalness, expressiveness, or conversational cadence into clear, defensible, and automated metrics that researchers and leadership can rely on.
  • Possess strong software engineering skills (especially in Python) and have experience building reliable distributed systems, data pipelines, or evaluation harnesses that can run at scale against live model checkpoints.
  • Can deeply partner with ML researchers to define exactly what "good" looks like for a Speech LLM, translating capabilities (like ASR robustness in noisy environments or TTS emotional steerability) into measurable benchmarks.
  • Are comfortable building and owning dashboards that track model health during training, improving signal-to-noise ratios, reducing evaluation latency, and making performance regressions impossible to miss.
  • Rapidly debug anomalous mid-training results to determine if a drop in performance stems from the model architecture, corrupted data, or infrastructure.
  • Communicate complex statistical results and model behaviors clearly to both technical and non-technical stakeholders.

Benefits

  • Competitive Compensation: $180K - $270K base salary + performance bonus + Equity.
  • Comprehensive Benefits: Top-tier healthcare for employees and dependents, including dental and vision, and a generous employer subsidy.
  • Retirement Planning: 401(k) plan for full-time employees with company matching.
  • Paid Time Off: Unlimited PTO, plus 13 paid holidays.
  • New Parent Leave: 12 weeks of paid time off to spend time with your new family, regardless of gender.
  • Hybrid Office: Minimum of 3x in-office per week to foster highly collaborative, fast-paced research.
  • Gear & Perks: Choice of top-of-the-line laptops/workstations, annual offsites, and a fully stocked office.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service