Machine Learning Safety: Evaluation Research Engineer

Apple•Seattle, WA

About The Position

This role is for a Machine Learning expert to drive the operational setup, execution, and quality assurance of safety evaluations across languages and markets. You will play a crucial role in collaborative development of canonical evaluation guidelines, with subject matter experts and partners on evaluation task configuration, running pilots, monitoring live evaluations, and ensuring data quality throughout the evaluation lifecycle. An ideal candidate possesses strong data science fundamentals, and experience managing complex annotation or evaluation tasks. This role will involve designing evaluations to scale across diverse linguistic contexts, by partnering with subject matter experts and cross-functional partners. You will play a crucial role in building upon product safety requirements to create taxonomies, compose and curate exemplar safety evaluation datasets, and ensure that evaluation frameworks are culturally and linguistically grounded. An ideal candidate possesses a strong understanding of sociotechnical evaluation design principles and practices, experiences designing evaluations to support policies and/or product requirements, and classification systems, and annotation and/or study participant guidelines.

Requirements

3+ years of experience in a data science, applied research, or evaluation operations role, with hands-on experience managing annotation or evaluation pipelines.
Proficiency in Python and experience with data processing, statistical analysis, and visualization libraries (e.g., pandas, NumPy, scipy, matplotlib, seaborn).
Experience developing and maintaining annotation guidelines or evaluation protocols for human labeling tasks.
Comfortable computing and interpreting inter-rater reliability metrics (e.g., Cohen's kappa, Krippendorff's alpha) and other data quality indicators.
Demonstrated ability to collaborate with annotation operations services, vendor teams, or distributed study participants .
Able to work independently as well as collaboratively with minimal direction.
Organized, highly attentive to detail, and manages time well.
1+ year of experience working in industry.

Nice To Haves

Advanced degree (MS/PhD) in Data Science, Statistics, Computational Linguistics, Information Science, or a related field.
Experience operating evaluation or annotation pipelines across multiple languages or markets.
Familiarity with annotation platforms and task management tools (e.g., Label Studio, Scale AI, or similar).
Experience with SQL and large-scale data infrastructure (e.g., Spark, Hadoop, or cloud-based analytics platforms).
Prior experience in AI safety, responsible AI, content moderation, or trust and safety domains.
Experience designing quality assurance frameworks for crowdsourced or distributed annotation work.
General familiarity with localization workflows or working with language service providers.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume