Eval Engineer

BraintrustNew York, NY
14d

About The Position

We’re hiring an Eval Engineer to design and run creative evaluations of new AI capabilities. Your job is to turn emerging AI ideas into measurable experiments and publish the results for the developer ecosystem. When new models, agents, or frameworks appear, everyone has opinions about what works but few people actually test them. This role exists to change that. You’ll design experiments that compare models, prompts, and agent architectures against real tasks. You’ll build the datasets, scoring logic, and evaluation harnesses. Then you’ll publish the results so builders understand what actually works. This role sits at the intersection of engineering, experimentation, and technical storytelling.

Requirements

  • Built or contributed to evaluation systems for LLM or agent applications
  • Designed experiments comparing models, prompts, or AI architectures
  • Written Python code to run tests across models or APIs
  • Built datasets or scoring logic for AI quality measurement
  • Investigated model failures or unexpected behaviors
  • Published technical blog posts, research notes, or engineering write-ups
  • Built prototypes quickly to test ideas

Responsibilities

  • Design and run evaluations of new AI capabilities
  • Compare frontier models, agent systems, and tool workflows
  • Turn emerging ideas into measurable benchmarks
  • Define datasets, tasks, and scoring logic for experiments
  • Design realistic workloads that reflect production environments
  • Create tests that expose failure modes and edge cases
  • Build evaluation harnesses using Braintrust
  • Run comparisons across models, prompts, and agent approaches
  • Analyze traces, outputs, and failure patterns
  • Invent novel ways to stress test AI systems
  • Design scenarios that break agents, prompts, and model reasoning
  • Build adversarial or complex datasets that reveal weaknesses
  • Write technical posts explaining evaluation methodology and results
  • Share datasets and scoring logic so experiments are reproducible
  • Help establish better evaluation patterns for the industry via courses
  • Develop reusable eval patterns for agents, RAG systems, and LLM apps
  • Create open source reference implementations developers can adopt
  • Contribute examples and guides that help teams build better evals

Benefits

  • Medical, dental, and vision insurance
  • Daily lunch, snacks, and beverages
  • Flexible time off
  • Competitive salary and equity
  • AI Stipend

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

101-250 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service