Evaluations Engineer

Vals AISan Francisco, CA
Onsite

About The Position

We are looking for strong engineers to join our team and own the leaderboards that appear on Vals AI. You will be responsible for testing and benchmarking new models as they are released on tasks in law, tax, coding, finance, and more. You will analyze error modes of models, evaluate their strengths and weaknesses, and work with our communications team to release results. Our results are used by startups, enterprises, and research labs alike. We work with all the major foundation model labs, some of the largest financial institutions, and hospital systems in the world. Our work has been featured by the Wall Street Journal, Washington Post, and Bloomberg. We are building the standard for evaluating the ability of LLMs to perform real-world tasks. You will contribute directly to the leaderboards that make this possible.

Requirements

  • Familiarity with the LLMs: You should already be familiar with the space - the current leading models, relative performance across them, how to use large language models in practice.
  • Strong engineering fundamentals: You can build and ship quickly with high quality. You should have a track record of building things of significant scope (at jobs, side projects, open source, etc.)
  • Python expertise: Significant experience in Python, especially in a professional setting.
  • Team collaboration: Experience working in development sprints, Git workflows, and pull request reviews.
  • Location: We are an in-person team based in San Francisco. We will support your relocation or transportation as needed.

Nice To Haves

  • Previous experience with benchmarking large language models, or creating benchmarks
  • Previous experience working at a startup or starting your own company
  • Technical writing experience and ability
  • Machine learning research experience

Responsibilities

  • Evaluate new LLM model releases across the Vals AI suite of benchmarks
  • Work directly with both open-source and closed-source foundation model labs in evaluating model performance
  • Use tools like Docent to analyze common failure modes and patterns in model performance
  • Work directly with our social media team to post interesting findings and results
  • Add new models and maintain integrations in our model library
  • Help improve and maintain the infrastructure we use to run benchmarks (agentic and non-agentic).
  • Collaborate closely with our research team on the creation of new benchmarks

Benefits

  • Highly competitive salary and meaningful ownership. Excellence is well rewarded.
  • Relocation and transportation support
  • Health/dental insurance coverage
  • Lunch and dinner provided, free snacks/coffee/drinks
  • 401K plan
  • Unlimited PTO
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service