AI systems are only as trustworthy as the methods used to evaluate them. At Apple, where AI powers experiences for billions of people, getting evaluation right is not a support function. It is a foundational science. As these systems grow in complexity, the quality of our products is increasingly constrained by the quality of our evaluation methods. Our team is building the scientific foundation and self-service tools for how AI evaluation is done at scale, spanning LLMs, agentic systems, and human-AI interaction. We don’t just publish methods; we productionize them. We are looking for a Sr. Research Manager to lead an ML research team that advances the state-of-the-art in evaluation methods that can be shipped as production tools for Apple developers and published in top venues. We are looking for a Sr. Research Manager to lead a ML research team advancing the frontier of evaluation methods. The team works in close collaboration with applied scientists and measurement scientists to build evaluation methodology and systems that are human-centered, psychometrically rigorous, and technically frontier. You will set the research agenda, direct the team's portfolio across near-term and long-term bets, and ensure that novel methods are designed from the outset for productionization into evaluation SDKs and APIs. The team has active projects across multiple research areas; your most immediate contribution will be bringing strategic focus to this portfolio, leading a research lifecycle that turns your team’s work into high-impact internal applications, and positioning work for external impact at top-tier venues. You will have a strong ML background and a track record of leading research teams that publish at venues like NeurIPS, ICML, and ICLR while simultaneously shipping methods into production tools. What makes this team unusual is its interdisciplinary core. You will lead ML researchers working alongside measurement scientists and applied scientists, bringing together frontier ML research, psychometric rigor, and production engineering. What unites the strongest candidates is depth of thinking about evaluation as a research problem and the conviction that how we measure AI systems is as important as how we build them.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
Ph.D. or professional degree