Research Engineer – Evals

Firecrawl•San Francisco, CA

51d•Hybrid

About The Position

This role focuses on building the evaluation systems for Firecrawl, a company that reliably converts URLs into clean, structured, LLM-ready data. The challenge lies in rigorously measuring this core promise across diverse websites, formats, and edge cases, especially as models and agent workflows are integrated. The position involves designing metrics, building pipelines, generating datasets, and managing the feedback loop from output quality to model and product decisions. It's an opportunity for engineers who are passionate about defining and measuring 'good' performance with strong engineering depth.

Requirements

3+ years in ML engineering, applied AI, or data quality with production systems experience.
Experience building own eval infrastructure, including pipelines, datasets, rubrics, and validation of judges.
Experience running evals at scale and debugging them.
Experience with unstructured web data, understanding challenges in markdown quality and structured extraction fidelity.
Strong opinions on useful benchmarks and the rigor to validate them.
Fluent in LLM evaluation methodology, including LLM-as-judge systems, correlation with human judgment, and failure modes.
Experience designing rubrics, building scalable human review pipelines, and measuring inter-rater agreement.
Production-minded, caring about whether evals reflect real production behavior and making tradeoffs between evaluation depth, coverage, and cost.
Ability to iterate quickly and communicate findings clearly.
Backgrounds in ML engineering with eval or data quality systems at AI labs or applied teams.
Experience with LLM fine-tuning or RLHF pipelines.
Experience at the intersection of data infrastructure and model development.

Nice To Haves

Any other niche expertise and skills
Previous experience at a scraping, automation, or security-focused startup
Ex-founder

Responsibilities

Build the eval stack from scratch, designing and owning systems to measure the quality of Firecrawl's outputs across scrape, crawl, extract, and map.
Define metrics, build pipelines, curate datasets, and integrate evals into CI/CD to catch regressions.
Design benchmarks that reflect real-world scenarios across millions of websites, including SPAs, paywalled content, dynamic rendering, and various data formats.
Design collection and labeling systems for benchmark datasets.
Own LLM-as-judge pipelines, designing and validating automated judges, understanding LLM evaluation failure modes, and building human review tooling.
Turn quality measurements into reward signals and feedback loops for RL and Search/IR research engineers to improve models.
Design and run fast experiments to test meaningful hypotheses and communicate findings clearly.
Collaborate with RL and Search/IR research engineers to integrate evaluation feedback into model training.

Benefits

Technical ownership
Real impact
High velocity
Small team, big ambition
Salary that makes sense — $140,000-180,000/year (U.S.-based)
Own a piece — Up to 0.15% equity
Unlimited PTO — Minimum 3 weeks off encouraged
Parental leave — 12 weeks fully paid
Wellness stipend — $100/month
Learning & Development - Expense up to $150/year
Team offsites
Sabbatical — 3 paid months off after 4 years
Medical, dental, and vision (100% for employees, 50% for spouse/kids) (US-based)
Life & Disability insurance — Employer-paid short-term disability, long-term disability, and life insurance (US-based)
Supplemental options — Optional accident, critical illness, hospital indemnity, and voluntary life insurance (US-based)
Doctegrity telehealth (US-based)
401(k) plan (US-based)
Pre-tax benefits — Access to FSAs and commuter benefits (US-based)
Pet insurance (US-based)
SF HQ perks — Snacks, drinks, team lunches, and the occasional burst of chaotic startup energy (SF-based)