Software Engineer, Data Infrastructure

Cartesia•San Francisco, CA

1d•Onsite

About The Position

Data is the lifeblood of our models, and we're looking for a Software Engineer to help build the training data and ML data infrastructure at Cartesia. This role sits at the intersection of data systems, model training, and inference — it is not a siloed data org. You'll design and ship the pipelines, datasets, and infrastructure that feed our pre-training and post-training, with particular depth in audio and other multimodal data. Your work will directly shape the capabilities and quality of our foundation models. This is a hands-on technical role. We're looking for someone fluent at the application and ML infrastructure layer, who ships modern, well-tested code and partners closely with research and inference teams. This is not a traditional data warehousing, analytics, or BI engineering role.

Requirements

Hands-on experience with ML data infrastructure: training data pipelines, dataset versioning, large-scale data loading, and the interplay between data systems and model training and inference.
Working knowledge of multimodal data, i.e. audio: formats, preprocessing, augmentation, and large-scale storage and streaming patterns.
Strong modern engineering execution: clean, well-tested code, fluency with current tools, and a willingness to pick the right tool for the problem rather than defaulting to familiar patterns.
Track record of driving significant technical projects end-to-end in a fast-moving, research-driven environment.
Familiarity with building and evaluating datasets for generative models and reasonable working knowledge of how they're trained and inference.

Responsibilities

Contribute to Cartesia's multi-modal data strategy across pre-training and post-training, spanning human, synthetic, and web-scale sources, with particular depth in audio.
Design and build scalable, high-throughput data pipelines for text, audio, and video — covering ingestion, preprocessing, augmentation, dataset versioning, and data loading for training.
Partner closely with research and inference teams so data systems are co-designed with training and serving infrastructure (batching, GPU-aware loading, evaluation pipelines).
Drive rigorous standards for data quality, with a tight feedback loop between dataset characteristics and model behavior.
Identify and integrate novel datasets, including working with external data vendors and partners.