Manager, AI Quality Evals

Glean•Mountain View, CA

2d•$195,000 - $240,000•Hybrid

About The Position

We are hiring a Manager, AI Quality Evals to build and lead a small, senior team responsible for the human evaluation system behind Glean’s AI product quality. This role will own the operational backbone for AI output evaluation: labeling operations, quality feedback analysis and triage, evalset quality, rubric design, and recurring quality reporting. This is a high-visibility, cross-functional leadership role at the center of product quality, model evaluation, and release readiness. You will partner closely with Engineering, Product, QA, and the eval infrastructure team to ensure every dataset, judge, and rubric we rely on is realistic, validated, and tied to measurable quality lift.

Requirements

7+ years of experience in evaluation operations, data labeling, quality operations, ML data operations, technical program management, or a related function leading ambiguous, cross-functional work.
Experience leading or mentoring high-performing ICs in a small, fast-moving team environment.
Deep familiarity with AI product evaluation, human labeling systems, benchmarking, rubric design, and quality measurement.
Experience with LLM-as-a-judge systems, calibration, or model-eval tooling.
Experience supporting quality intelligence or benchmark programs for AI products.
Strong analytical judgment; you can separate signal from noise, identify failure patterns, and turn ambiguous quality issues into clear actions.
Comfortable operating at the intersection of Product, Engineering, Data Science, and infrastructure teams, and able to lead through influence rather than authority.
Strong communication and operating rigor; you can run repeatable cadences, align stakeholders, and present crisp quality insights to leadership.
High bar for realism and quality; you care not just about eval volume, but about measurable lift in product quality and ROI.
Experience with vendor-managed data workflows, dataset acceptance, or data quality gates.

Responsibilities

Lead Glean’s end-to-end operational evaluation motion across human labeling, benchmarking, and product-quality analysis.
Build and manage a high-leverage team focused on answer quality, evals, and recurring labeling operations.
Establish operating rhythms for weekly triage and labeling, eval health reviews, and quality readouts.
Maintain repeatable benchmarks and frontier-model baselines.
Evaluate performance across priority workflows and quality dimensions such as task success, response quality, MCP and tool use, actions reliability, and end-to-end Cowork workflow execution.
Partner with the eval team to curate new evalsets, generate labeled datasets, support model training needs, and improve LLM-as-a-judge raters.
Own labeling guidelines, human calibration, and the ongoing effort to debias LLM judges against human labels.