About The Position

Data is playing an increasingly crucial role at the frontier of AI innovation. Many of the most meaningful advances in recent years have come not from new architectures, but from better data. As a member of the Data Team, your mission is to build and operate the ingestion systems that turn the open web and other large-scale data sources into reliable, well-structured corpora for training frontier models. You will own the machinery that acquires, extracts, normalizes, versions, and delivers data to our pre-training pipelines. You’ll work directly with world-class researchers to close the loop between what we collect and how it impacts model performance. This role is ideal for engineers who love building robust distributed systems, but who also want to run experiments, reason about tradeoffs in data acquisition, and iterate quickly based on measurable impact. Working closely with our pre-training and data quality teams, you will:

Requirements

  • Curious about how training data influences model capabilities, and can iterate quickly based on measurable downstream impact
  • Able to collaborate tightly across functions: researchers, infra, operations, and external partners.
  • Enjoy working in a hybrid research–engineering role
  • Experience building web crawling, data ingestion, or large-scale data acquisition systems using Ray, Beam, Spark, or similar technologies.
  • Familiarity with how LLMs are trained and evaluated, and an intuition for what makes data useful for training
  • Comfortable working with very large datasets (multi-TB to PB scale) and building systems that are observable, testable, and maintainable
  • Comfortable designing experiments and using data to guide system improvements
  • Excellent communication skills. You can explain system behavior. You consider and communicate tradeoffs clearly

Responsibilities

  • Build and operate large-scale data ingestion systems for pre-training, including web crawling, extraction, and dataset delivery
  • Run experiments to evaluate crawling strategies, extraction methods, and ingestion tradeoffs
  • Analyze ingested data to identify gaps, redundancy, and areas to improve
  • Build ingestion pipelines that scale reliably across large data campaigns
  • Develop specialized crawlers for high-priority data sources
  • Review code, debug production issues, and continuously improve ingestion infrastructure

Benefits

  • Top-tier compensation: Salary and equity structured to recognize and retain the best talent globally.
  • Health & wellness: Comprehensive medical, dental, vision, life, and disability insurance.
  • Life & family: Fully paid parental leave for all new parents, including adoptive and surrogate journeys. Financial support for family planning.
  • Benefits & balance: paid time off when you need it, relocation support, and more perks that optimize your time.
  • Opportunities to connect with teammates: lunch and dinner are provided daily. We have regular off-sites and team celebrations.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

51-100 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service