Head of Research

Vals AI•San Francisco, CA

3d•Onsite

About The Position

Measuring intelligence is hard, and humans haven't been particularly good at it. The proxies we've used — IQ, standardized tests, credentials — have shaped how we develop intelligence and how we value it, often in ways we later regret. AI gives us a chance to do better. The field is young enough that the methodologies for measuring what these systems can actually do are still being written, and the answers we settle on will shape what gets built, what gets deployed, and which workflows get automated next. Vals is building the measurement layer for the AI economy: the benchmarks, methodologies, and standards that determine which models ship and where they get trusted. We're hiring a Head of Research to lead it. The hard research questions don't have textbook answers yet. How do you measure whether an LLM can actually do a real lawyer's contract review, a real underwriter's risk assessment, a real radiologist's read? How do you build evaluations that hold up as models get better at gaming them? You'll be the person setting the direction on how Vals — and by extension, much of the field — answers them. Concretely, you'll: Advance the science of evaluation. The methodologies the field uses today — judge models, human-in-the-loop, static benchmarks — were built for a previous generation of models and break down on long-horizon, real-world tasks. You'll develop the new paradigms. Oversee Vals' broader research portfolio, setting direction across the projects already underway and the ones we haven't started yet. Publish work that moves the field forward. We want Vals' research to be cited, not just shipped. Recruit and grow a research team alongside the founders. Work directly with our enterprise customers and lab partners on the evaluation problems they actually have.

Requirements

A PhD in ML/NLP (in progress or completed), or equivalent industry research track record
Deep familiarity with the LLM evaluation landscape: existing benchmarks, their failure modes, judge-model approaches, human-in-the-loop methodologies.
A bias toward research that affects what people actually deploy, rather than benchmarks that are easy to game.
Strong written and verbal communication. You'll publish, present, and talk to customers and labs.
Ability to work in-person, in San Francisco.

Nice To Haves

A widely-cited benchmark or eval framework you've built or co-built.
Prior experience at a frontier lab (Anthropic, OpenAI, Google DeepMind, Meta FAIR) or a research-led startup.
Domain depth in one or more of our verticals (legal, finance, insurance, healthcare).
Experience leading or mentoring other researchers.
A public research presence: papers, blog posts, talks, or open-source contributions people in the field recognize.

Responsibilities

Advance the science of evaluation. The methodologies the field uses today — judge models, human-in-the-loop, static benchmarks — were built for a previous generation of models and break down on long-horizon, real-world tasks. You'll develop the new paradigms.
Oversee Vals' broader research portfolio, setting direction across the projects already underway and the ones we haven't started yet.
Publish work that moves the field forward. We want Vals' research to be cited, not just shipped.
Recruit and grow a research team alongside the founders.
Work directly with our enterprise customers and lab partners on the evaluation problems they actually have.