LATAM)

Anyone AI

3d•Remote

About The Position

This role is responsible for Anyone AI’s data initiatives and proposals to AI labs, covering the entire process from data proposal or responding to requests, through pilot delivery. The individual will own the creation of proposals, the development of sample packages, and benchmarks, including frontier-grade packages across reasoning, coding, agents, and tool use, multi-modal, and others. These will be produced in collaboration with subject-matter experts, featuring expert-verified ground truth, multi-model headroom results, and QC that meets buyer-side scrutiny. The role is central to demonstrating the company's quality and converting pilots into production engagements, serving as the operational core of the Human Data Division within a small team.

Requirements

Originated data or benchmark proposals for AI labs.
Translated eval targets into sample tasks that demonstrate capability.
Owned the engagement through delivery.
Deep evaluation and quality expertise, particularly in LLM benchmarking and code-model evaluation.
Built QC processes and artifact standards that met enterprise or lab requirements.
Set a quality bar for a team of experts.
Ability to thrive in ambiguous, fast-moving environments where the rules are still being written.
Ability to deliver under pressure.
5+ years in technical delivery, quality, or program management, with recent experience in AI/ML data, model evaluation, or benchmarking.
Hands-on experience delivering data or evaluation work to AI labs or enterprise ML teams, from scoping through delivery.
Working fluency with how frontier models are evaluated: benchmarks, rubrics, pass rates, headroom, and what makes a task discriminate a model.
Proven people/vendor leadership: experience recruiting, calibrating, and holding a team or expert pool to a quality standard.
Fluent English.

Nice To Haves

Spanish is a nice to have.

Responsibilities

Study public benchmarks and eval targets, and turn them into proposals and sample packages that demonstrate capability and win the work.
Respond to lab data requests and pilots.
Design and build the sample packages, working with subject-matter experts, ensuring each package meets the bar of the current sample set: Expert-verified, exact-match-checkable ground truth and gold reasoning trajectories.
Develop multi-model evaluation showing real headroom, and proof the task discriminates the model, not just that it's hard.
Implement rigorous QC structure: calibration layers, severity-weighted rubrics, deterministic verifiers, evidence maps, etc.
Recruit, brief, calibrate, and review a pool of experts across coding, agentic/tool-use, and STEM/reasoning.
Raise expert output to the company's standard and maintain it; act as the arbiter of what "correct" and "frontier-difficulty" mean.
Serve as a direct point of contact for lab partners on Slack and calls, with support from the CEO and the wider team.
Keep senior lab contacts informed, surface their actual needs, and pull in the CEO and subject-matter experts when necessary.
Own pilots end to end: scoping, SOW, staffing, production, QC, and delivery, ensuring nothing ships before it's lab-ready and potential issues are already known.