Freelance Agent Evaluation Engineer

Mindrift

9d•Remote

About The Position

Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. This role involves creating challenging tasks and evaluation criteria within realistic simulated environments to assess AI coding agents' ability to handle real-world developer tasks. The work is project-based, not permanent employment.

Requirements

5+ years in software development
Core stack: Python (FastAPI), JavaScript/TypeScript (React), Docker, Postgres, Kafka, Redis
Experience writing tests (functional, integration)
English proficiency - B2+

Responsibilities

Build realistic developer environments, including virtual companies with codebases, infrastructure, and context (tickets, docs, conversations) that form a believable development history.
Design tasks from intermediate states of these environments, crafting prompts, defining 'solved' criteria, and ensuring tasks are solvable by an AI agent.
Write tests that verify agent solutions, accepting all valid approaches and rejecting incorrect ones.
Iterate on tasks and tests based on QA feedback, reviewing agent solutions, analyzing failures, and refining until the evaluation is fair and robust.