Staff Machine Learning Platform Engineer, AI Evaluation

Apple•Seattle, WA

About The Position

Join Apple Services Engineering to build the next generation of AI evaluation systems. This role seeks a staff machine learning platform engineer to lead the architectural design and development of high availability services and internal tools that power self-service evaluation at scale. The engineer will partner with researchers to operationalize their innovations, transforming complex workflows into intuitive, developer-first platforms. The position is for builders who thrive in the ambiguity of new initiatives and are passionate about creating scalable infrastructure. The role focuses on developing the developer experience by architecting and implementing APIs, SDKs, and platform services that simplify complex evaluation metrics into self-service calls. The engineer will work closely with researchers to operationalize sophisticated measurement techniques, ensuring reliable scaling within high-availability infrastructure. This role also involves driving engineering standards for a new organization, maintaining code quality, automation, and testing rigor to support the rapid evolution of Generative AI and Agentic systems.

Requirements

8+ years of hands-on software engineering experience, with a track record of owning the technical direction of a platform or infrastructure domain.
Strong proficiency in the Python ecosystem (e.g., FastAPI, Pydantic, Pandas).
Ability to write production-grade code and lead architectural discussions.
Customer Obsession & Product Thinking: Owned the technical roadmap for an internal platform, presented it to senior stakeholders, and shipped against it.
Ability to independently translate vague requirements from other teams into concrete engineering specifications and platform roadmaps.
Demonstrated experience leading technical partnerships with Data Scientists or Researchers, including taking research code and shipping it as a production service and building the abstractions, testing frameworks, and deployment pipelines.
Strong expertise in API Design & Platform Infrastructure: Designed and owned APIs and SDKs that other developers rely on, with a focus on versioning, backward compatibility, and developer experience at scale.
Operational excellence background: Architected and owned CI/CD pipelines, containerization (Docker/Kubernetes), and monitoring (Datadog/Prometheus) for production services, and accountability for their reliability.
Bachelor's degree in Computer Science or a related field.

Nice To Haves

Deep familiarity with AI Evaluation Frameworks (e.g., DeepEval, Ragas, TruLens, or LangSmith) and understanding how to implement and scale model-based evaluation workflows across a large organization.
Experience owning the deployment, scaling, and operational health of evaluation services in production, including high-throughput evaluation job orchestration (queueing, prioritization, concurrency, auto-scaling), and defining SLAs for evaluation pipeline latency and availability.
Experience instrumenting production ML evaluation pipelines including tracking evaluation job throughput, queue depth, judge model latency SLAs, scoring drift over time, and failure modes specific to non-deterministic LLM-based evaluation workflows.
Deep understanding of Generative AI & Agents, including the engineering challenges of relying on LLMs and Agents as software components (managing token economics, handling rate limits, evaluating non-deterministic, multi-step reasoning capabilities), and having built production systems that depend on these components and solved these problems at scale.
Thrived in startup-like environments, navigating high ambiguity to deliver complex technical roadmaps from scratch.
Master's degree.

Responsibilities

Lead the architectural design and development of high availability services and internal tools powering self-service evaluation at scale.
Partner with researchers to operationalize their innovations, transforming complex workflows into intuitive, developer-first platforms.
Develop the developer experience by architecting and implementing APIs, SDKs, and platform services that turn complex evaluation metrics into simple, self-service calls.
Work hand-in-hand with researchers to operationalize sophisticated measurement techniques, ensuring they scale reliably within high-availability infrastructure.
Drive the engineering standards for a new organization, upholding the code quality, automation, and testing rigor required to support the rapid evolution of Generative AI and Agentic systems.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume