ML Engineer- Agentic Systems Evaluation

Apple•Cupertino, CA

63d

About The Position

We are looking for a high-impact ML Evaluation Engineer to help architect rigorous evaluations systems for autonomous agents. With the rise of generative AI, the ability to quantify the reliability and quality of these systems is more critical than ever. You will design and deploy qualitative and quantitative metrics to measure the quality, reasoning, and tool-use accuracy of agentic systems. You will be working with very sensitive data, so leveraging existing and developing new privacy enhancing technologies -- such as differential privacy, PII redaction, and data minimization -- will be crucial. The team you will be joining is focused on advancing scalable automated processes for evaluation. To succeed, you will need a deep understanding of system-level software operations to deliver next-generation capabilities. Join the Proactive Intelligence team to build the evaluation platforms for the future of intelligent, personalized experiences.

Requirements

MS or PhD in Computer Science, Machine Learning, Statistics, or equivalent practical experience in a quantitative field
3+ years of industry experience in ML Engineering or Applied Science
Strong software engineering fundamentals (Python is a must) with experience building scalable, automated data or evaluation pipelines

Nice To Haves

Demonstrated experience applying Differential Privacy, Federated Learning, or advanced PII redaction techniques to large-scale datasets
Hands-on experience building or testing LLM-based systems, including a deep understanding of chain-of-thought reasoning, prompt engineering, and agentic planning
Proficiency in building or evaluating systems that integrate with external tools/APIs
Experience with specialized agent evaluation frameworks and analyzing execution traces to identify failure modes in multi-turn interactions
Experience with compiled languages (e.g., Swift) and a curiosity about how ML interacts with OS-level software operations
A track record of developing custom metrics (e.g., "LLM-as-a-Judge") or publishing research on model reliability, safety, or algorithmic bias

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume