ASE Compute - Site Reliability Engineering (SRE) Manager

Apple•Seattle, WA

About The Position

Apple's ASE Compute team builds and operates the private cloud infrastructure that powers Apple services at massive scale. Our platform delivers bare-metal Kubernetes clusters and virtualized environments to thousands of engineers across the company. We are looking for an SRE Manager to lead a team that keeps this infrastructure reliable, performant, and ready for the next order-of-magnitude growth. This is a hands-on leadership role. You will set the technical direction for reliability and operational excellence while mentoring engineers, driving automation, and partnering closely with software and infrastructure teams to ship improvements that matter. You will have direct impact on the platform that underpins Apple's most critical services. You will work alongside world-class engineers solving problems at a scale few organizations encounter — and you will build a team culture that makes reliability engineering sustainable and rewarding.

Requirements

3+ years of engineering management experience leading infrastructure or SRE teams
Deep experience operating large-scale, multi-tenant Kubernetes environments in production
Strong systems background — comfortable troubleshooting across the full stack (network, OS, container runtime, application)
Experience with configuration management at scale (Puppet, Ansible, or equivalent)
Track record of building high-performing teams through coaching, clear expectations, and psychological safety
Demonstrated ability to drive cross-functional initiatives to completion
Strong written and verbal communication skills

Nice To Haves

Experience with third-party cloud platforms (AWS, GCP, or Azure)
Familiarity with bare-metal provisioning and lifecycle management at datacenter scale
Experience with Java, Go, or Python services in production
Understanding of cloud-native observability (Prometheus, Thanos, Splunk, or similar)
CNCF Certified Kubernetes Administrator (CKA) or equivalent hands-on certification
Experience running infrastructure as an internal managed service with defined SLAs

Responsibilities

Lead, grow, and mentor a team of Site Reliability Engineers focused on large-scale Kubernetes and compute infrastructure
Own the reliability, availability, and performance of mission-critical cloud platform services
Drive incident response, post incident reviews, and systemic improvements that reduce operational toil
Partner with software engineering and architecture teams to influence system design for reliability, scalability, and operability
Establish and refine SRE practices including SLOs, error budgets, capacity planning, and change management
Champion automation — eliminate manual processes through tooling and self-service capabilities
Manage on-call rotations and ensure sustainable, well-supported operational coverage
Communicate clearly across teams to build a culture of visibility, transparency, and shared ownership

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume