AI Data Ops Lead

Sanas•Palo Alto, CA

60d

About The Position

WeÊ¼re looking for a hands-on AI Data Ops Lead to own the datasets that power ourspeech and language models and analytics thereof. YouÊ¼ll design and maintain data pipelines, labeling workflows, and dashboards that transform raw multimodal data into actionable insights. This role blends data engineering with analytical depth-ideal for someone who can write production-grade Python, evaluate dataset quality, and surface trends that shape model development. YouÊ¼ll collaborate with and support Scientists, Data Collection teams, Executives, and external vendors to bring new data sources online, run data collection and labeling, automate data ingestion, and deliver transparent reporting across the AI data lifecycle

Requirements

3-6 years of experience in data science, data operations, or ML data workflows
Strong programming skills in Python (pandas, NumPy, SQL, FastAPI or similar).
Proven experience building and maintaining Data dashboards (Gradio, Streamlit, Plotly, Dash, PowerBI, or similar).
Strong data analysis and visualization skills; comfort working with large, complex datasets
Familiarity with databases and cloud data infrastructure (SQL, DynamoDB, AWS Glue, S3, BigQuery, etc.)
Excellent communication and documentation skills; thrive in a fast-moving AI environment.

Nice To Haves

Experience with speech or audio datasets (e.g., ASR, TTS, voice embeddings, or diarization).
Familiarity with data labeling workflows for audio or text.
Knowledge of signal processing, spectrogram analysis, or acoustic feature extraction.
Experience with data orchestration tools (Dagster, Airflow, etc.)
Experience with building custom tooling on a need-basis (Retool, Replit, etc.)
Exposure to dataset versioning, evaluation pipelines, and MLOps principles.
Interest in advancing the data foundations of AI research

Responsibilities

Build and maintain internal tools for data collection, labeling, and ingestion.
Discover new data sources and prepare them into unified data frames for consumption
Coordinate with multiple stakeholders to ensure timely delivery of high quality data.
Operate and design ETL data pipelines for large-scale audio, text, and metadata.
Own data quality: Build tooling for quality assurance across all dimensions, discover inaccuracies and fix them + feed back into improving the QA tooling
Analyze dataset coverage, diversity, and quality; monitor bias and data drift.
Create dashboards and visual reports tracking data distribution, collection throughput, and collection quality.
Work cross-functionally to ensure that the data being made available meets our continuously evolving needs.
Run a monthly newsletter reporting about any changes being made to the data and all the new data sources being made available.
Design validation experiments for labeled datasets.
Implement automated checks for consistency, completeness, and noise reduction.
Support research teams with well-documented, high-integrity datasets