The ML Infrastructure team is responsible for managing Apple’s largest ML compute platform, multi-cloud storage abstraction and caching platform, which supports critical machine learning training workloads that power user-facing features across the Apple ecosystem. Operating across both first-party and third-party cloud environments brings complex and unique challenges. As a Site Reliability Engineer (SRE) on the ML Infrastructure team, you’ll be expected to address these challenges through a strong foundation in cloud object storage, data analysis, automation, collaboration, and advanced expertise in Kubernetes. Our team oversees the full infrastructure stack — from low-level nodes to the complete network architecture — ensuring our platform remains highly available, resilient, and efficient at scale. We are seeking an experienced Software and Systems Engineer to join our dynamic team. This role demands a proactive mindset, technical excellence, and a collaborative spirit.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed