Evaluations Engineer

Vals AI•San Francisco, CA

55d•Onsite

About The Position

We are looking for strong engineers to join our team and own the leaderboards that appear on Vals AI. You will be responsible for testing and benchmarking new models as they are released on tasks in law, tax, coding, finance, and more. You will analyze error modes of models, evaluate their strengths and weaknesses, and work with our communications team to release results. Our results are used by startups, enterprises, and research labs alike. We work with all the major foundation model labs, some of the largest financial institutions, and hospital systems in the world. Our work has been featured by the Wall Street Journal, Washington Post, and Bloomberg. We are building the standard for evaluating the ability of LLMs to perform real-world tasks. You will contribute directly to the leaderboards that make this possible.

Requirements

Familiarity with the LLMs: You should already be familiar with the space - the current leading models, relative performance across them, how to use large language models in practice.
Strong engineering fundamentals: You can build and ship quickly with high quality. You should have a track record of building things of significant scope (at jobs, side projects, open source, etc.)
Python expertise: Significant experience in Python, especially in a professional setting.
Team collaboration: Experience working in development sprints, Git workflows, and pull request reviews.
Location: We are an in-person team based in San Francisco. We will support your relocation or transportation as needed.

Nice To Haves

Previous experience with benchmarking large language models, or creating benchmarks
Previous experience working at a startup or starting your own company
Technical writing experience and ability
Machine learning research experience

Responsibilities

Evaluate new LLM model releases across the Vals AI suite of benchmarks
Work directly with both open-source and closed-source foundation model labs in evaluating model performance
Use tools like Docent to analyze common failure modes and patterns in model performance
Work directly with our social media team to post interesting findings and results
Add new models and maintain integrations in our model library
Help improve and maintain the infrastructure we use to run benchmarks (agentic and non-agentic).
Collaborate closely with our research team on the creation of new benchmarks

Benefits

Highly competitive salary and meaningful ownership. Excellence is well rewarded.
Relocation and transportation support
Health/dental insurance coverage
Lunch and dinner provided, free snacks/coffee/drinks
401K plan
Unlimited PTO

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume