ML Evaluation Specialist, Human Data

Apple•Cupertino, CA

About The Position

At Apple, we don’t just build products — we build experiences fueled by world-class data. The Human-centered AI team within Apple Services Engineering is looking for an ML Evaluation Specialist, Human Data to join our Data Quality and Operations division to spearhead complex, multi-stakeholder operations that specialize in data collection, curation, annotation, and human evaluation efforts across Apple Music, App Store, TV+, Podcasts, and Books. In this role, you will own the operational strategy and continuous improvement of large-scale, multilingual human data programs, from designing onboarding scaffolds that progressively build annotator calibration, to analyzing annotator behavior patterns to identify where automation can offload low-judgment decisions, to enforcing quality frameworks that close the loop between annotator struggle and task redesign. You will identify where human judgment is essential and where it could be better directed, then build the scaffolding, automation, and feedback systems that let annotators focus their cognitive energy where it matters most. Because this work cuts across engineering, data science, research, procurement, and legal, a critical part of the role is serving as the connective tissue between teams who each own a piece of this space, aligning on shared standards, surfacing gaps, and ensuring that insights from the annotation layer inform upstream decisions about task design and tooling. You will bring a point of view on human data best practices and translate it into scalable, human-centered approaches that make generative AI features safer and more reliable. The ideal candidate brings a rare combination of technical depth and program execution skills. You are comfortable designing and deploying sophisticated data pipelines in the morning, and then seamlessly transitioning to present comprehensive quality rectification strategies to stakeholders in the afternoon. You care deeply about data quality and human alignment, have a creative and systematic approach to finding and fixing problems, and find motivation in wide-ranging work whose impact shows up in everyday Apple experiences.

Requirements

Bachelor's degree or higher in Cognitive Science, Linguistics, or a related field that includes an experimental or empirical component
4+ years of experience defining and leading cross-team human data programs for AI/ML, including annotation operations, quality frameworks, and evaluation strategies, within an NLP/NLU or generative AI environment
Proficiency in programming and data languages (Python, R, SQL) to process, analyze, query large datasets, extract insights, automate tasks, and monitor program performance
Hands-on experience designing and managing 0→1 human-in-the-loop data collection, annotation, and evaluation initiatives, including driving and incorporating agentic workflows to improve quality and scalability
Experience working with diverse data types (e.g., speech, text, multimodal) across multiple languages
Expertise in end-to-end data annotation quality management, including the ability to develop statistical process controls and data quality metrics
Familiarity with privacy-preserving data handling practices and compliance frameworks
Demonstrated success optimizing data pipelines and workflows to improve quality, reduce lead time, and scale operations
Experience working cross-functionally with engineering, data science, legal, privacy, and third-party suppliers

Nice To Haves

Master's degree or higher in Cognitive Science, Linguistics, or a related field that includes an experimental or empirical component
2+ years of experience owning data strategy for frontier AI development and evaluation, with experience in human alignment methodologies and agentic GenAI systems
Experience managing external vendor or workforce partners at scale
Familiarity with AI Safety and Responsible AI principles, including experience applying them to data collection or annotation workflows
Strong organizational skills and execution-oriented mindset; ability to balance attention to detail with big-picture thinking in an environment where program scope and priorities evolve quickly
Excellent written and verbal communication skills; able to translate technical concepts for non-technical stakeholders

Responsibilities

Own the operational strategy and continuous improvement of large-scale, multilingual human data programs.
Design onboarding scaffolds that progressively build annotator calibration.
Analyze annotator behavior patterns to identify where automation can offload low-judgment decisions.
Enforce quality frameworks that close the loop between annotator struggle and task redesign.
Identify where human judgment is essential and where it could be better directed.
Build scaffolding, automation, and feedback systems that let annotators focus their cognitive energy where it matters most.
Serve as the connective tissue between teams (engineering, data science, research, procurement, legal) to align on shared standards, surface gaps, and ensure insights from the annotation layer inform upstream decisions about task design and tooling.
Bring a point of view on human data best practices and translate it into scalable, human-centered approaches that make generative AI features safer and more reliable.
Design and deploy sophisticated data pipelines.
Present comprehensive quality rectification strategies to stakeholders.