Manager, AI Quality Evals

Glean•Mountain View, CA

2d•Hybrid

About The Position

Glean is seeking a Manager, AI Quality Evals to build and lead a team responsible for the human evaluation system behind Glean’s AI product quality. This role will oversee the operational aspects of AI output evaluation, including labeling operations, quality feedback analysis and triage, evalset quality, rubric design, and recurring quality reporting. This is a high-visibility, cross-functional leadership position at the intersection of product quality, model evaluation, and release readiness. The successful candidate will collaborate with Engineering, Product, QA, and the eval infrastructure team to ensure all datasets, judges, and rubrics are realistic, validated, and contribute to measurable quality improvements.

Requirements

7+ years of experience in evaluation operations, data labeling, quality operations, ML data operations, technical program management, or a related function leading ambiguous, cross-functional work.
Experience leading or mentoring high-performing ICs in a small, fast-moving team environment.
Deep familiarity with AI product evaluation, human labeling systems, benchmarking, rubric design, and quality measurement.
Experience with LLM-as-a-judge systems, calibration, or model-eval tooling.
Experience supporting quality intelligence or benchmark programs for AI products.
Strong analytical judgment; ability to separate signal from noise, identify failure patterns, and turn ambiguous quality issues into clear actions.
Comfortable operating at the intersection of Product, Engineering, Data Science, and infrastructure teams, and able to lead through influence rather than authority.
Strong communication and operating rigor; ability to run repeatable cadences, align stakeholders, and present crisp quality insights to leadership.
High bar for realism and quality; focus on measurable lift in product quality and ROI.
Experience with vendor-managed data workflows, dataset acceptance, or data quality gates.

Responsibilities

Lead Glean’s end-to-end operational evaluation across human labeling, benchmarking, and product-quality analysis.
Build and manage a high-leverage team focused on answer quality, evals, and recurring labeling operations.
Establish operating rhythms for weekly triage and labeling, eval health reviews, and quality readouts.
Maintain repeatable benchmarks and frontier-model baselines.
Evaluate performance across priority workflows and quality dimensions such as task success, response quality, MCP and tool use, actions reliability, and end-to-end Cowork workflow execution.
Partner with the eval team to curate new evalsets, generate labeled datasets, support model training needs, and improve LLM-as-a-judge raters.
Own labeling guidelines, human calibration, and the ongoing effort to debias LLM judges against human labels.