LLM Evals Engineering Lead

Grafton Sciences•San Francisco, CA

42d

About The Position

We’re seeking a Senior LLM Evals Engineer to build the evaluation and verification layer for agentic, LLM systems acting in complex environments driving autonomous workflows. You’ll design eval suites, automated verifiers, and regression gates that measure real progress on long-horizon planning, agent execution, uncertainty retirement, and end-to-end build success. This role spans systems engineering, rigorous experimentation, and tight collaboration with LLM scientists, agent/toolchain engineers, and simulation teams.

Requirements

Strong experience building evaluation systems for ML models (LLMs preferred) with high engineering rigor.
Excellent software engineering skills (Python, data pipelines, test harnesses, distributed execution, reproducibility).
Deep understanding of agentic failure modes (tool misuse, hallucinated evidence, reward hacking, brittle formatting) and how to measure them.
Ability to work across research and production systems in a fast-moving environment.

Responsibilities

Build an eval harness for agentic LLM systems (offline, simulator-in-the-loop, and workflow-in-the-loop).
Design evals for long-horizon planning, specific agent-call correctness, recovery behavior, and safety/constraint adherence.
Help with verifier-driven scoring (symbolic checks, simulation/twin checks, surrogate checks) and automated self correction of execution pipeline.
Create regression gates and release criteria for model/prompt/toolchain changes; prevent capability and safety regressions.
Define metrics for outliers identification and efficient question-asking that reduces uncertainty per unit time.
Partner with training teams to turn eval failures into data (SFT/DPO/RL signals) and continuously improve the suite.