Research Scientist, Benchmarks & Evaluations

Protege

About The Position

DataLab is Protege’s research arm, focused on tackling fundamental challenges in data for AI. This role involves leading the design of benchmarks and evaluations for AI models, ensuring they are trustworthy and relevant for frontier labs, enterprises, and policymakers. The Research Scientist will be responsible for the scientific rigor of evaluation across DataLab, designing tasks that effectively differentiate model capabilities, validating these tasks against human baselines, and identifying issues like contamination and bias. The work will involve publishing research to establish Protege as a standard-setter and translating findings into product for evaluation datasets. A significant part of the role includes managing outsourced annotation vendors and developing statistical methods to ensure data quality and trustworthiness. Protege is building a platform to address the critical need for secure, efficient, and privacy-centric AI training data exchange. The company is backed by top investors and is positioned to be a major player in the AI and tech industries. The culture emphasizes velocity, impact, scientific rigor, and ownership, attracting individuals who thrive in ambiguous environments and want to shape the future of data and AI.

Requirements

Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field such as applied econometrics with AI experience, quantitative finance, computer science, engineering, statistics/mathematics, or any applied research discipline.
Hands-on experience evaluating LLMs, agents, or other ML systems, including prompting, scaffolding, and familiarity with tools for large-scale evals.
Experience with annotator quality and inter-rater reliability, including designing labeling protocols, computing agreement statistics, and analyzing annotator bias and calibration.
Excellent scientific writing and communication skills, capable of synthesizing technical findings for diverse audiences (frontier labs, enterprise customers, policymakers).
A bias toward velocity and the ability to determine which pipelines should be production-grade versus scrappy to achieve reliable results quickly.

Nice To Haves

Experience with RL evaluation techniques, such as reward modeling, off-policy evaluation, and evals for RLHF/RLAIF or agentic RL pipelines.
Ability to quickly navigate new customer architectures, data systems, and requirements.
Experience with latent-variable models of annotator skill (e.g., Dawid-Skene, MACE, IRT-style approaches) or running large expert-annotator panels in regulated domains.
A track record of published benchmarks or evaluation papers that have been adopted by the field.

Responsibilities

Design tasks and benchmarks that distinguish capability levels across frontier models, including agentic, reasoning-heavy, and domain-specific settings (healthcare, finance, scientific).
Validate evaluations rigorously by running human baselines, analyzing inter-rater reliability, studying the impact of elicitation and scaffolding, and quantifying signal versus noise.
Develop the "science of evals" at Protege, incorporating item response theory, contamination analysis, predictive validity studies, and statistical frameworks for model comparison with uncertainty.
Run evaluations on current frontier models, potentially collaborating with partners at AI labs, enterprises, and government.
Publish research to establish Protege as a standard-setter for evaluation data and contribute to the broader AI community's understanding of effective evaluations.
Translate research findings into product, working with data and engineering teams to create evaluation datasets for customers.
Partner with outsourced annotation vendors, owning the statistical machinery to determine annotator trustworthiness, task suitability, and to generate reliability scores for customers.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume