Evaluation Reliability SRE

AppleCupertino, CA

About The Position

Siri’s quality signal drives every model and product decision before a release ships. But a signal is only as trustworthy as the infrastructure behind it. The Evaluation Reliability Engineering (ERE) team exists to make that infrastructure bulletproof. Within ERE, Core SRE owns the production backbone: resource management, session orchestration, on-call response, and the observability systems that surface failures before they corrupt evaluation signal. We sit at the intersection of distributed systems, ML evaluation infrastructure, and operational excellence.

Requirements

  • 5+ years of site reliability, infrastructure, or platform engineering experience with direct on-call ownership in production systems.
  • Hands-on orchestration experience (Kubernetes or equivalent): cluster health, resource management, scheduling, and failure diagnosis at scale.

Nice To Haves

  • Experience owning or closely operating a device or VM provisioning pipeline; familiarity with virtualization-layer failure modes is a strong plus.
  • Track record of improving system reliability against measurable outcomes — uptime, MTTR, incident frequency — not just responding to incidents but eliminating their causes.
  • Incident command discipline: able to lead a multi-team incident from declaration to close-out.
  • Depth in at least one of: distributed systems reliability, device management infrastructure, evaluation or ML platform operations.
  • Demonstrated cross-team technical influence; prior experience shaping reliability practices beyond the immediate team.

Responsibilities

  • Share primary on-call as part of a global follow-the-sun rotation.
  • Lead incident investigations end-to-end.
  • Set the operational bar the rest of the team works against.
  • Use agentic coding tools like Claude Code, Cursor, or Copilot as a force multiplier across runbook authoring, automation, and log analysis.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service