Research, Pre-Training Data

Thinking Machines Lab•San Francisco, CA

13d•Onsite

About The Position

Thinking Machines Lab is seeking Pre-Training Researchers to join their team. This role is central to the company's roadmap, blending research with large-scale data engineering to build the datasets and data systems for the next generation of AI models. The ideal candidate will design and implement methods for sourcing, curating, and analyzing pre-training data for quality and performance, working with both automated pipelines and human-in-the-loop processes. This position requires strong coding skills and the ability to contribute scientific insight. It is suited for individuals who enjoy the intersection of data, machine learning, and systems, and are excited by the challenge of shaping frontier AI. The role emphasizes both fundamental research and practical engineering, requiring the ability to write high-performance code and analyze technical reports. This is an evergreen role, meaning applications are continuously reviewed for current and future opportunities.

Requirements

Proficiency in Python and familiarity with at least one deep learning framework (e.g., PyTorch, TensorFlow, or JAX).
Comfortable with debugging distributed training and writing code that scales.
Bachelor’s degree or equivalent experience in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding.
Clarity in communication, an ability to explain complex technical concepts in writing.

Nice To Haves

A strong grasp of probability, statistics, and ML fundamentals. Ability to look at experimental data and distinguish between real effects, noise, and bugs.
Experience with curation, preprocessing, and analysis of large-scale text, code, or multimodal datasets.
Prior experience in data engineering, dataset construction, or large-scale web data processing for machine learning models.
Experience evaluating or improving training data quality and knowledge of data ethics, safety, and licensing frameworks relevant to AI dataset creation.
Contributions to open datasets, research publications, or data tooling.
PhD in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding; or, equivalent industry research experience.

Responsibilities

Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data.
Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources.
Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly.
Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use.
Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior.
Publish and present research that moves the entire community forward. Share code, datasets, and insights that accelerate progress across industry and academia.