Human Evaluation - Program Manager

Netflix•New York, NY

About The Position

Netflix is building toward more intelligent and responsive systems, and thoughtful, high-quality evaluation is essential to ensuring progress in the right direction. This role involves joining a team that creates the frameworks, tools, and workflows to ensure human judgment is applied with consistency, clarity, and care across various evaluation criteria like helpfulness, tone, safety, relevance, or creative quality. The position will shape how human and AI-driven evaluations are designed and will own the day-to-day execution of these efforts, from scoping and planning to rater onboarding and calibration. The role also involves acting as a thought partner and influencer to align stakeholders, introduce new ways of working, and establish a shared language around quality. The work will ensure AI features are high-performing and aligned with Netflix's values, users, and brand integrity. The role operates within a small team focused on rigorous, aligned, and effectively resourced evaluation designs executed at scale.

Requirements

4+ years of experience working in human evaluations, data collection, labeling, or annotation operations in GenAI/ML environments
Track record of implementing process improvements or quality control systems for data collection needs
Prior experience managing human annotation vendors, raters, or data labeling teams
Strong understanding of evaluation design, including guidelines, rubrics, and scoring protocols
Proven ability in end-to-end management of complex, cross-functional programs, demonstrating strong Program Management skills and clear accountability for successful delivery.
Experience with human labeling platforms
Excellent written and verbal communication skills
Ability to synthesize feedback into clear recommendations and process improvements
Familiarity with responsible AI principles and how to embed them into evaluation design
Strong organizational skills and executional focus; ability to track details while seeing the bigger picture

Responsibilities

Lead end-to-end execution of human evaluation and data operations initiatives—from intake and scoping to delivery
Develop and operationalize frameworks for evaluating GenAI and ML outputs
Collaborate across research, product, UX, and engineering to embed evaluation into model development cycles
Build and maintain project timelines, proactively manage blockers, and ensure timely execution
Develop clear, scalable guidelines and scoring rubrics to ensure consistent rater judgment
Oversee rater onboarding, calibration, and QA workflows
Define and monitor success metrics such as speed to IRR, throughput, and task effectiveness
Pilot and refine evaluation tasks to improve clarity, inter-rater reliability, and feedback quality
Build foundational documentation and drive adoption of best practices across teams
Track evaluation health and proactively communicate progress to stakeholders clearly and proactively
Anticipate and proactively resolve bottlenecks and blockers
Act as the connective tissue across multiple partners to ensure alignment and effective execution of evaluations at scale