Machine Learning Engineer, Core Data

Cantina•San Francisco, CA

About The Position

Cantina Labs is a social AI company focused on developing advanced real-time models for expression, personality, and realism to bring characters to life. They aim to transform how people tell stories, connect, and create by building and powering ecosystems. The company is seeking an ML Engineer specializing in Data Quality to manage the datasets crucial for their speech systems. This role involves hands-on work with audio and text data, including auditing, denoising, filtering, labeling, and developing tools and models to transform raw data into reliable training corpora for Text-to-Speech (TTS) and related tasks. The engineer will establish data quality metrics and classifiers, manage human-in-the-loop annotation programs, and implement quality gates within training and evaluation pipelines. The objective is to directly enhance model performance, robustness, and cost-efficiency by managing the data aspect of the model-data-evaluation cycle.

Requirements

Strong experience building ML-driven data quality systems for audio/speech, or equivalent data-centric ML experience with a track record of improving model outcomes via better data.
Proficient in Python and PyTorch; training/finetuning SSL-ASR (Whisper, Wav2Vec, BERT) models, CNN based classifiers and writing robust production code.
Audio/speech fundamentals: torchaudio/librosa/ffmpeg, spectrogram features (e.g., log-mel, MFCC), VAD/SAD, basic DSP, and audio QA.
Scalable data engineering skills: Spark/Beam or similar, SQL, Airflow or equivalent orchestration, and cloud storage/computing (AWS/GCP).
Familiarity with ASR/TTS metrics and tooling: WER, MOS/MOSNet, PESQ/STOI/ViSQOL, speaker verification (EER), diarization, language ID.
Experience with dataset validation, versioning, and experiment tracking; comfort debugging data issues from single samples to fleet-wide trends.
Ability to balance rigor with speed, and to translate ambiguous requirements into measurable data improvements.

Nice To Haves

Shipped datasets and/or data quality tooling that moved the needle for TTS/ASR/VC in production.
Built and deployed classifiers for LID, SV/diarization, VAD, noise/glitch detection, or safety/content moderation for audio.
Ran crowdsourcing/vendor annotation at scale with strong quality control (honeypots, IAA, label aggregation).
Background in de-noising/enhancement and their effects on downstream TTS quality.
Contributions to open-source or publications in speech/audio/ML.
Experience with data governance, consent tracking, and policy enforcement.

Responsibilities

Dataset ownership: define specs; audit and curate large-scale audio/text; close corpus gaps and fix sample-level issues.
Quality instrumentation: build automated gates/metrics (e.g., SNR, clipping, VAD, WER, SV/LID, safety) with dashboards; validate against listening tests.
Classifiers and filters: train lightweight models to tag, score, and filter data (VAD, ASR gating, LID, SV/diarization, noise/safety); calibrate to subjective outcomes.
Cleaning and integrity: apply denoise/dereverb/de-clip when beneficial; deduplicate and decontaminate; prevent leakage; maintain lineage and versioned releases.
Data selection: optimize mixtures via sampling, weighting, curriculum, and active learning; mine hard negatives and long-tail cases.
Tooling and pipelines: ship reproducible ETL and validation; integrate quality gates into training/eval; add monitoring and alerts.
Human-in-the-loop and compliance: run MTurk/vendor annotation with strong QC; ensure consent/licensing/policy compliance; collaborate across teams and document datasets.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume