Software Engineer, Data Infrastructure

CartesiaSan Francisco, CA
Onsite

About The Position

Data is the lifeblood of our models, and we're looking for a Software Engineer to help build the training data and ML data infrastructure at Cartesia. This role sits at the intersection of data systems, model training, and inference — it is not a siloed data org. You'll design and ship the pipelines, datasets, and infrastructure that feed our pre-training and post-training, with particular depth in audio and other multimodal data. Your work will directly shape the capabilities and quality of our foundation models. This is a hands-on technical role. We're looking for someone fluent at the application and ML infrastructure layer, who ships modern, well-tested code and partners closely with research and inference teams. This is not a traditional data warehousing, analytics, or BI engineering role.

Requirements

  • Hands-on experience with ML data infrastructure: training data pipelines, dataset versioning, large-scale data loading, and the interplay between data systems and model training and inference.
  • Working knowledge of multimodal data, i.e. audio: formats, preprocessing, augmentation, and large-scale storage and streaming patterns.
  • Strong modern engineering execution: clean, well-tested code, fluency with current tools, and a willingness to pick the right tool for the problem rather than defaulting to familiar patterns.
  • Track record of driving significant technical projects end-to-end in a fast-moving, research-driven environment.
  • Familiarity with building and evaluating datasets for generative models and reasonable working knowledge of how they're trained and inference.

Responsibilities

  • Contribute to Cartesia's multi-modal data strategy across pre-training and post-training, spanning human, synthetic, and web-scale sources, with particular depth in audio.
  • Design and build scalable, high-throughput data pipelines for text, audio, and video — covering ingestion, preprocessing, augmentation, dataset versioning, and data loading for training.
  • Partner closely with research and inference teams so data systems are co-designed with training and serving infrastructure (batching, GPU-aware loading, evaluation pipelines).
  • Drive rigorous standards for data quality, with a tight feedback loop between dataset characteristics and model behavior.
  • Identify and integrate novel datasets, including working with external data vendors and partners.

Benefits

  • Competitive base salary alongside attractive equity package.
  • A monthly stipend to help you get to and from the office.
  • Flexible PTO. Take as much time as you need to recharge your batteries.
  • Lunch, dinner and plenty of snacks, provided daily.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service