Member of Technical Staff (Data Acquisition)

Sanas•Palo Alto, CA

About The Position

Your mission is to build and operate the ingestion systems that turn the open web and large-scale audio sources into reliable, well-structured corpora for training Sanas's frontier speech models. You'll own the machinery that acquires, extracts, filters, versions, and delivers audio data to our training pipelines — and you'll work directly with our research scientists to close the loop between what we collect and how it moves model quality.

Requirements

4+ years of experience in data engineering, ML data infrastructure, or backend systems engineering — with direct experience building large-scale data ingestion or crawling systems.
Strong Python and systems engineering skills — you build robust, maintainable infrastructure, not just one-off scripts.
Hands-on experience with distributed systems design: you've built systems that handle failure gracefully, scale horizontally, and recover cleanly.
Experience with web crawling infrastructure at scale including handling rate limiting, deduplication, and content extraction.
Proficiency with cloud platforms (AWS or GCP), object storage (S3/GCS), and container orchestration (Kubernetes).
Comfort working with audio processing tooling — ffmpeg, librosa, torchaudio, sox — and experience handling large volumes of audio files.
Strong data quality instincts: you instrument pipelines, surface issues proactively, and treat data correctness with the same rigor as software correctness.

Nice To Haves

Experience building speech or audio datasets for ASR, TTS, speech enhancement, or speaker verification model training.
Familiarity with major open speech corpora — Common Voice, LibriSpeech, VoxPopuli, AISHELL — and their sourcing and quality characteristics.
Experience with data versioning tools.
Background in multilingual or low-resource language data collection.
Experience with annotation and labeling platforms.
Familiarity with speaker diarization, language identification, or automated audio quality estimation models used for data filtering at scale.

Responsibilities

Own and lead engineering projects across the full data acquisition stack — web crawling, audio ingestion, source discovery, and dataset delivery to training pipelines.
Build and operate large-scale distributed crawling infrastructure capable of continuously discovering and ingesting audio at scale across languages, accents, domains, and recording environments.
Develop specialized crawlers for high-priority audio sources with source-specific extraction and normalization logic.
Run experiments to evaluate crawling strategies, extraction methods, and ingestion tradeoffs; analyze results to identify gaps, redundancy, and coverage improvements across speaker demographics and language pairs.
Build ingestion pipelines that scale reliably across large data campaigns, with automated audio quality filtering — SNR estimation, clipping detection, codec artifact identification — as a first-class pipeline stage.
Design and deploy highly scalable distributed systems capable of handling petabytes of audio data — from raw acquisition through quality filtering, deduplication, segmentation, and versioned dataset generation.
Architect and implement indexing and search capabilities over large audio corpora — enabling fast lookup by language, speaker, acoustic condition, duration, and quality tier.
Build and maintain backend services for data storage, including key-value databases, metadata synchronization, and manifest management across dataset versions.
Deploy and operate acquisition infrastructure in a Kubernetes / Infrastructure-as-Code environment; perform routine system health checks and respond to production issues quickly.
Collaborate closely with data processing, architecture, and ML platform teams to ensure smooth data flow from acquisition through to training-ready outputs.
Work closely with legal to handle compliance, data privacy, and licensing matters across all acquisition sources — maintaining a clear audit trail of provenance, permitted use, and commercial training rights for every dataset.
Enforce speaker consent documentation, GDPR requirements, robots.txt and ToS adherence, and audio retention policies across all ingestion pipelines.
Manage relationships with third-party data vendors — writing precise acquisition briefs, evaluating quality on delivery, and ensuring sourced data meets Sanas's licensing and quality standards.