Research Engineer, Model Evaluations

Anthropic•San Francisco, CA

11h•Hybrid

About The Position

Anthropic is seeking Research Engineers to build and implement evaluations for their AI systems, specifically focusing on Claude. The role involves turning abstract concepts of intelligence into measurable metrics, designing and executing evaluations across Claude's capabilities and personality, and developing the infrastructure to run these evaluations at scale. The goal is to establish Anthropic as a leader in well-characterized AI systems with exhaustively measured and validated performance. This position requires close collaboration with researchers throughout the lifecycle of new capabilities, from defining measurement criteria to interpreting results.

Requirements

Strong Python programming skills, including experience with production or research infrastructure.
Experience building or operating distributed systems, data pipelines, or other scalable, reliable infrastructure.
Clear written and verbal communication skills, particularly in explaining technical results to non-specialists.
Comfort operating in an on-call or production-support capacity during live training runs.
A genuine care for the societal impacts of AI and an interest in steering powerful AI towards safety and benefit.

Nice To Haves

Hands-on experience using large language models like Claude, including prompting, sampling, and scaffolding.
Background in data visualization and a track record of building trusted dashboards.
Experience developing robust evaluation metrics for language models.
Experience with observability, monitoring, or experiment-tracking systems.
Background in statistics and experimental design.
Experience with large-scale dataset sourcing, curation, and processing.
Experience running or supporting ML training infrastructure.
A bias toward picking up slack and operating flexibly across team boundaries.
Enjoyment of pair programming.

Responsibilities

Design and run new evaluations of Claude's capabilities, including reasoning, agentic behavior, knowledge, and safety properties, and produce visualizations of the results.
Build and maintain a distributed eval execution platform for reliable, large-scale evaluation runs against training checkpoints.
Own and improve dashboards for monitoring model health during training, enhancing signal-to-noise, reducing latency, and preventing regressions.
Debug anomalous eval results during training runs, identify the root cause (model change or infrastructure issue), and communicate findings clearly under pressure.
Enhance tooling, libraries, and workflows for researchers implementing and iterating on evaluations.
Partner with research teams across the entire lifecycle of new capabilities, from defining measurement goals to interpreting results.
Conduct experiments to characterize the impact of prompting, sampling, and scaffolding choices on internal and industry benchmarks.
Communicate evaluation findings to internal stakeholders and external audiences as appropriate.