Freelance Agent Evaluation Engineer

Mindrift

18h•Remote

About The Position

Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation is project-based, not permanent employment. This opportunity involves building a dataset to evaluate AI coding agents by creating challenging tasks and evaluation criteria within realistic simulated environments. You will build virtual companies, assemble and calibrate tasks, design tasks in isolated environments, write tests, iterate with AI agents, review code, and analyze agent performance. This role is not data labeling, prompt engineering, or writing code from scratch, but rather guiding and evaluating AI-generated code.

Requirements

Degree in Computer Science, Software Engineering, or related fields
5+ years in software development, primarily Python (FastAPI, pytest, async/await, subprocess, file operations)
Background in full-stack development, with experience building React-based interfaces (JavaScript/TypeScript) and robust back-end systems
Experience writing tests (functional, integration — not just running them)
Experience with Docker containers
Familiarity with infrastructure tools (Postgres, Kafka, Redis)
Understanding of CI/CD (GitHub Actions as a user: triggers, labels, reading results)
English proficiency - B2

Responsibilities

Build virtual companies following a high-level plan, including codebase, infrastructure, and context (conversations, documentation, tickets) to create a realistic environment with development history.
Assemble and calibrate tasks from intermediate states of the virtual company by crafting prompts, defining evaluation criteria, and ensuring tasks are solvable with fair evaluation.
Design tasks set in isolated environments, emulating a developer's workstation with tools like a Linux machine, development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation, etc.), and a real web application codebase.
Write tests that accept all correct solutions and reject incorrect ones, ensuring they are neither too strict nor too lenient.
Iterate with an AI agent on tests, verifying they catch real problems, don't miss bad solutions, and don't break on good ones.
Review code written by agents, analyze why an agent failed or succeeded, and design edge cases and adversarial scenarios.
Iterate based on feedback from expert QA reviewers who score your work on quality criteria.