Data engineer , Machine Learning

Sesame•San Francisco, CA

About The Position

Sesame is seeking a Data Engineer to construct and manage the data pipelines essential for their AI models. This role involves close collaboration with machine learning engineers and researchers, ensuring they have timely access to accurate and appropriately formatted data for model training, evaluation, and deployment. The data at Sesame is diverse, encompassing conversations, voice recordings, sensor signals, and product telemetry. The engineer will be responsible for designing systems that transform raw, unstructured, multimodal data into clean, versioned, and well-documented datasets that ML teams can rely on. This is a highly technical, infrastructure-centric position, closely aligned with ML engineering rather than traditional data analytics. The Data Engineer will be integrated with ML teams to understand their workflows and develop infrastructure that accelerates the entire model development lifecycle, from data collection and labeling to training and evaluation.

Requirements

5+ years in data engineering, with meaningful experience supporting ML or AI teams specifically
Strong SQL and Python skills — you'll use both daily
Experience building and operating ETL/ELT pipelines at scale using modern data platforms and tooling
Experience with workflow orchestration systems such as Airflow, Dagster, or Prefect
Hands-on experience with ML data workflows: training data pipelines, dataset versioning, data labeling pipelines, or model evaluation data
A solid understanding of how ML teams work — you don't need to train models; what matters is understanding what makes a good training dataset and why data quality directly affects model performance
Comfort working with unstructured and semi-structured data — audio, text, JSON logs — not just clean relational tables
Strong communication skills. You'll be embedded with ML engineers and need to bridge data systems and model requirements effectively

Nice To Haves

Vector databases, embedding storage, or feature stores
Data from hardware or embedded systems: telemetry, sensors, real-time streams
Distributed compute frameworks for large-scale data processing such as Ray or Spark
Kubernetes and managed Kubernetes environments such as GKE or EKS
Data privacy frameworks, especially around voice or conversational data
Building internal tooling or self-serve data platforms

Responsibilities

Design and build production data pipelines that prepare conversational, voice, and multimodal data for model training and evaluation
Partner directly with ML engineers to understand data requirements for new models and experiments, and deliver datasets that meet those needs
Build and maintain infrastructure for dataset versioning, lineage tracking, and reproducibility — so any training run can be traced back to its exact data
Develop data quality frameworks that catch issues before they become model quality issues: schema validation, drift detection, and coverage monitoring
Optimise large-scale data processing for cost and performance across Sesame's cloud infrastructure
Build tooling that makes it easy for ML engineers and researchers to discover, explore, and request data independently
Define and enforce data governance and privacy standards, particularly around sensitive conversational and voice data
Contribute to architecture decisions around Sesame's broader data platform as the team and data volume grow

Benefits

401 (k) max employer match: 3.5% of compensation
100% employer-paid health, vision, and dental benefits for you and your dependents
Unlimited PTO and sick time
Flexible spending account with employer matching up to $1,650/year (medical FSA)
Guardian Employee Assistance Program (EAP)
Opportunity to share in the company's success with competitive stock options

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume