Eval Engineer

Braintrust•New York, NY

14d

Save Job Match Resume

About The Position

We’re hiring an Eval Engineer to design and run creative evaluations of new AI capabilities. Your job is to turn emerging AI ideas into measurable experiments and publish the results for the developer ecosystem. When new models, agents, or frameworks appear, everyone has opinions about what works but few people actually test them. This role exists to change that. You’ll design experiments that compare models, prompts, and agent architectures against real tasks. You’ll build the datasets, scoring logic, and evaluation harnesses. Then you’ll publish the results so builders understand what actually works. This role sits at the intersection of engineering, experimentation, and technical storytelling.

Requirements

Built or contributed to evaluation systems for LLM or agent applications
Designed experiments comparing models, prompts, or AI architectures
Written Python code to run tests across models or APIs
Built datasets or scoring logic for AI quality measurement
Investigated model failures or unexpected behaviors
Published technical blog posts, research notes, or engineering write-ups
Built prototypes quickly to test ideas

Responsibilities

Design and run evaluations of new AI capabilities
Compare frontier models, agent systems, and tool workflows
Turn emerging ideas into measurable benchmarks
Define datasets, tasks, and scoring logic for experiments
Design realistic workloads that reflect production environments
Create tests that expose failure modes and edge cases
Build evaluation harnesses using Braintrust
Run comparisons across models, prompts, and agent approaches
Analyze traces, outputs, and failure patterns
Invent novel ways to stress test AI systems
Design scenarios that break agents, prompts, and model reasoning
Build adversarial or complex datasets that reveal weaknesses
Write technical posts explaining evaluation methodology and results
Share datasets and scoring logic so experiments are reproducible
Help establish better evaluation patterns for the industry via courses
Develop reusable eval patterns for agents, RAG systems, and LLM apps
Create open source reference implementations developers can adopt
Contribute examples and guides that help teams build better evals

Benefits

Medical, dental, and vision insurance
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity
AI Stipend

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

101-250 employees

Job Search Resources

Similar Eval Engineer job opportunities

Braintrust • San Francisco, CA12d

Machine Learning Eval Engineer

Reducto • San Francisco, CA10d • Onsite

Machine Learning Eval Engineer

Reducto • San Francisco, CA11d • Onsite

Senior Software Engineer, Eval Authoring APIs

Waymo • Mountain View, CA1d • Hybrid

🔥 New JobSenior Software Engineer, Eval Authoring APIs

Waymo • San Francisco, NY1d • Hybrid

🔥 New JobSenior Software Engineer, Data & Eval Platform

Dyna Robotics • Redwood City, CA23h

Explore More Jobs

© 2024 Teal Labs, Inc

Privacy Policy Terms of Service