Systems Engineer - Evaluation Engineering

Apple•Cupertino, CA

22h

About The Position

We are looking for a Distributed Systems Engineer to own the infrastructure powering our core Siri Agentic Evaluation Platform. Evaluation is no longer just a static test suite—it is a highly dynamic, massive-scale distributed problem. Our platform enables teams to run high-throughput agentic simulations, orchestrate multi-model judging pipelines, and generate real-time observability dashboards across billions of tokens and complex data types. In this role, you will design the execution engine that coordinates these complex evaluation loops. You will build systems that remain deterministic, fault-tolerant, and cost-efficient, even when coordinating massive parallel requests across heterogeneous device types(iPhones, Mac, iPads etc).

Requirements

MS in computer science or equivalent
7+ years of experience as distributed systems engineer, platform engineer or equivalent
Strong proficiency in languages optimized for concurrency and enterprise scale, such as Python (asyncio) or Java
Deep expertise in designing robust, versioned production APIs using gRPC/Protobuf, GraphQL, or REST (FastAPI)
Strong experience modeling complex relational data and trace hierarchies using PostgreSQL, combined with high-throughput analytical query layers.
Experience designing asynchronous, event-driven architectures using Kafka, AWS SQS/SNS, RabbitMQ, or Redis Streams.
Advanced experience with Kubernetes (orchestration, custom operators, service meshes like Istio or Linkerd) and cloud providers (AWS, GCP, or Azure).
Proficiency with Terraform to manage infrastructure declaratively.
Experience building automated, containerized deployment pipelines (GitHub or ArgoCD) with an emphasis on keeping developer feedback loops fast and reliable.

Nice To Haves

Experience building Agentic RAG platforms or developer-facing infrastructure tooling.

Responsibilities

Architect and scale the core asynchronous engine responsible for orchestrating thousands of parallel agent simulations, validation tests, and LLM-as-a-judge pipelines.
Design and build self-service infrastructure, CLI tools, and internal APIs that allow ML and product teams to easily integrate evaluation pipelines into their CI/CD workflows.
Design, build, and maintain highly performant, type-safe APIs (gRPC/REST) capable of serving complex evaluation pipelinee, trace data, and real-time generation metrics.
Build robust data pipelines to ingest and transform high-volume execution traces. Ensure immutable data lineage so that every evaluation metric can be perfectly traced back to its raw generation for granular error attribution.
Own the deployment topologies of the evaluation platform across multi-tenant clusters using declarative infrastructure and continuous delivery practices.
Implement deep observability (distributed tracing, structured metrics, and alerting) across the platform. Design smart scheduling layers, token buckets, and circuit breakers to prevent downstream API rate-limiting or cascading cluster failures.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume