Member of Technical Staff - Evals

P-1 AI•United States,

1d•Hybrid

About The Position

In this role, you’ll be responsible for the evals that we use to ensure that Archie is learning and retaining the skills needed to successfully perform its engineering work, and to benchmark it against industry skill expectations. Working within a small, tightly-knit team of high-performers, you’ll be principally responsible for clearly defining, implementing, and validating these, including input from our engineering experts and industrial partners. You’ll also be responsible for translating these eval tests into multiple formats for use with different types of AI and non-AI systems and agents. This role is remote and you can be based anywhere in the US or Canada, where you must have existing work authorization. You will be expected to travel to our San Mateo office for co-working sessions approximately one week out of every six. If you are already located in the Bay Area or are interested in relocation, you are of course welcome to work out of our San Mateo office. Our AI team is based in the San Mateo office, so there would be some benefit to you being in-office at least part of the time.

Requirements

Experience in constructing comprehensive test suites for software and/or AI systems, including coordinating the contributions of others.
Experience designing metrics to evaluate systems and visualize their performance, including differences across successive generations.
Good communication skills with a variety of stakeholders (AI researchers, domain experts, application developers).
Proficiency in Python programming, complex modules and modern software development tools and practices (Git, CI/CD, etc.).
Ability to thrive in a fast-paced, dynamic startup environment.

Nice To Haves

Experience in developing, managing, and running evals against LLM-based systems is a strong plus.

Responsibilities

Implement and operate the system for organizing, transforming, running, grading, and reporting on eval benchmarks.
Design and execute the process by which we develop and QA our evals, incorporating contributions from our own engineering team, industrial partners, and subject-matter experts.
Ensure that evals run effectively within our CI/CD system, continuously benchmarking our evolving AI platform and the experiments we’re performing around it.
Create methods for detecting and testing for common quality challenges of AI, including hallucinations, undesirable stochasticity, and regressions.
Be a technical leader in the consistent implementation and organization of automated tests across other areas of our technology stacks.