Research Crawling Engineer

Wynd Labs

4d•Remote

About The Position

We build infrastructure that delivers massive amounts of web data to the companies training the world’s most powerful AI models. We're the team that helps to power and support Grass, a bandwidth-sharing network that lets us operate a massive distributed crawler, giving us unique access to high-quality public web data at global scale. On top of that, we’ve built pipelines for ingesting, segmenting, and annotating billions of videos, transcripts, and audio files, powering dataset creation for frontier labs. We’re lean, technical, and move fast. No red tape, no slow decision-making; just a team of builders pushing to expand what’s possible for open web data and AI. As a Research Crawling Engineer, you will design and operate large-scale web data acquisition systems for research and model development. You will work will span distributed systems, scraping infrastructure, and data pipelines.

Requirements

Strong programming experience in one or more of: Go, Rust, Python, Java, or C++
Experience building web crawlers or large-scale data pipelines
Solid understanding of HTTP, networking, and browser behavior
Familiarity with distributed systems and parallel processing
Experience working with large datasets (TB–PB scale preferred)
Ability to debug unstable or adversarial environments

Nice To Haves

Experience with NLP pipelines or dataset curation for ML
Familiarity with LLM pretraining data or retrieval systems
Experience with headless browsers (e.g., Chrome DevTools Protocol, Playwright, Puppeteer)
Knowledge of proxy systems, IP rotation, and large-scale request orchestration
Background in data quality evaluation or benchmarking
Experience running workloads on cloud or bare-metal infrastructure

Responsibilities

Build and maintain large-scale web crawlers across diverse domains
Design high-throughput, fault-tolerant systems for data collection (millions to billions of URLs/day)
Handle anti-bot systems, rate limits, and dynamic/JS-heavy sites
Develop pipelines for cleaning, deduplication, filtering, and normalization
Construct and maintain datasets for research and model training
Monitor crawl performance, coverage, and data quality; iterate quickly
Collaborate with research teams to align data collection with modeling needs
Optimize infrastructure for cost, latency, and reliability