Research Engineer - Data

Periodic LabsMenlo Park, CA
Hybrid

About The Position

Periodic Labs is an AI and physical sciences company focused on building state-of-the-art models to accelerate breakthroughs in materials, energy, and beyond. The Research Engineer - Data will be responsible for building and driving the data foundation for the company's research efforts. This includes owning the end-to-end data strategy, from sourcing and procuring external datasets to integrating internally generated experimental data into the training stack. The role ensures that researchers have access to the right data in the optimal format for training and improving frontier models. It sits at the intersection of data engineering, research infrastructure, and strategy, requiring close collaboration with researchers to understand data needs and build the necessary pipelines and systems. The work involves collecting and organizing diverse data sources, improving data quality through deduplication and preprocessing, and ensuring new experimental results are incorporated in a structured, repeatable manner for model development.

Requirements

  • Experience building large-scale data pipelines for LLM pretraining or midtraining, including web-scale or scientific corpora
  • Expertise in data quality techniques such as exact and fuzzy deduplication (MinHash, SimHash), perplexity filtering, classifier-based quality scoring, and PII scrubbing
  • Experience working with diverse scientific data formats — papers, patents, structured databases, simulation outputs, lab instrument exports — and normalizing them for model consumption
  • Experience with distributed data processing frameworks such as Apache Spark, Ray, or Dask at multi-terabyte to petabyte scale
  • Familiarity with dataset versioning, lineage tracking, and reproducibility tooling such as DVC, Delta Lake, or custom solutions
  • Experience sourcing and evaluating third-party datasets, including licensing considerations and quality assessment
  • Strong Python engineering skills and comfort building production-quality tooling in a research environment
  • Experience collaborating directly with ML researchers to translate data needs into pipeline requirements and back again
  • A research-oriented mindset — you run experiments on data, measure outcomes, and iterate with rigor

Nice To Haves

  • Experience curating scientific datasets specifically for domain-adaptive continued pretraining or instruction tuning
  • Familiarity with synthetic data generation methods, including model-generated data pipelines and quality verification
  • A background in a physical science or engineering discipline that informs how you think about scientific data quality and structure
  • Experience with multimodal data — integrating text, structured numerical data, molecular representations, or spectral data into unified training pipelines

Responsibilities

  • Own data strategy across the training stack — identifying gaps, evaluating new sources, and shaping the overall data roadmap in collaboration with research leads
  • Source, evaluate, and procure external datasets across scientific domains including chemistry, physics, materials science, mathematics, and lab instrumentation
  • Build and maintain robust pipelines for ingesting, processing, and versioning large-scale datasets from heterogeneous sources
  • Design and implement data quality systems including deduplication, domain classification, quality filtering, and format normalization at scale
  • Integrate internally generated experimental data — from lab instrumentation, simulations, and model outputs — into the training stack in a structured and repeatable way
  • Build tooling that makes it easy for researchers to inspect, query, and understand the data that goes into training runs
  • Instrument data pipelines with metadata, lineage tracking, and versioning so experiments are reproducible and data decisions are auditable
  • Collaborate with pretraining and midtraining engineers on token budget management, data mixing ratios, and curriculum design
  • Stay current with research on data-efficient training, synthetic data generation, and data selection methods — and bring relevant ideas into production

Benefits

  • Visa sponsorship
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service