Our team builds the benchmarks, environments, and tooling that power model and agent refinement, and turns observations into actionable opportunities for the next model and agent iteration. We work across the full spectrum of evaluation: offline benchmarks, device-in-the-loop simulation, and on-device observation in production. We develop LLM-as-judge evaluators, train reward models calibrated against human feedback, optimize prompts and context for agents, and contribute targeted datasets and reward signals to foundation model post-training. In this role, you will play a crucial role in designing and developing evaluation and refinement infrastructure that supports a broad range of AI products at Apple. You will work on agent and model evaluation across offline, device-in-the-loop, and on-device settings; build automated prompt and context optimization pipelines; and partner with product and research teams to translate failure analysis into measurable model and agent improvements. You will also have the opportunity to engage with product teams across Apple and contribute to advancements in large language models and agentic systems that will reach millions of users. To succeed in this role, you should have a strong background in machine learning systems, distributed infrastructure, and a proven track record of building and maintaining ML evaluation or training infrastructure. You should be a proactive problem solver with excellent communication skills and the ability to work effectively across multiple codebases, teams, and organizations. Experience with LLM evaluation, reward modeling, prompt optimization, or agentic systems is highly desirable.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
Associate degree