Research Engineer - Evals

AGI, Inc.•San Francisco, CA

37d•Onsite

About The Position

We are seeking a Research Engineer specializing in Evals to join our mission of building everyday AGI. This role is crucial for ensuring that our models, agents, and product features demonstrably improve. You will be responsible for building the evaluation harness for AGI, covering model capability, agentic behavior, on-device performance, and end-user experience. Your work will establish the standard for what constitutes a 'shipped' product and protect that standard against product deadlines. You will own the eval suites that gate all model and agent releases, including capability, behavior, regressions, and human-rated rubrics. You will also develop the dashboards and tooling to facilitate researcher experiment loops and leadership decision-making. Ultimately, you will define and uphold the criteria for product readiness.

Requirements

Experience building eval harnesses for AI systems.
Ability to measure non-deterministic systems, including agent eval, tool use, long-horizon tasks, and multilingual behavior.
Skill in pushing back on metrics being gamed without disrupting team progress.
Experience with AI research, product engineering, and partnerships.
Ability to translate technical performance into understandable language for stakeholders.
A link to an eval, benchmark, or measurement system you built, with a paragraph explaining a decision it changed.

Nice To Haves

Understanding of on-device performance trade-offs and their impact on real-user evaluations.
Experience with QA for AI at OEM scale.
Familiarity with the realities of shipping consumer agents to production partners.

Responsibilities

Build the eval suites that gate every model and agent release, including capability, behavior, regressions, and human-rated rubrics.
Develop the dashboards and tooling that make researcher experiment loops fast and leadership decisions easy.
Define and maintain the standard for what counts as ready to ship.