About The Position

The ML Infrastructure team is responsible for managing Apple’s largest ML compute platform, multi-cloud storage abstraction and caching platform, which supports critical machine learning training workloads that power user-facing features across the Apple ecosystem. Operating across both first-party and third-party cloud environments brings complex and unique challenges. As a Site Reliability Engineer (SRE) on the ML Infrastructure team, you’ll be expected to address these challenges through a strong foundation in cloud object storage, data analysis, automation, collaboration, and advanced expertise in Kubernetes. Our team oversees the full infrastructure stack — from low-level nodes to the complete network architecture — ensuring our platform remains highly available, resilient, and efficient at scale. We are seeking an experienced Software and Systems Engineer to join our dynamic team. This role demands a proactive mindset, technical excellence, and a collaborative spirit.

Requirements

  • 5+ years experience in building, operating and scaling a large application in a private, public or hybrid cloud environment.
  • Deep expertise in Kubernetes, with hands-on experience using platforms such as Google Kubernetes Engine (GKE) or Amazon Elastic Kubernetes Service (EKS).
  • Proficient in designing, developing, and releasing code in languages such as Python, Go, or Rust.
  • Practical experience with object storage technologies, including Amazon S3 or Google Cloud Storage (GCS).
  • Strong background in designing and troubleshooting complex networking issues in both public and private cloud infrastructures.
  • Solid understanding of Linux internals, standard networking protocols, and distributed systems architecture.
  • Proven drive to automate manual operations and enhance processes through continuous iteration.
  • Strong understanding of best practices for deploying large-scale, distributed applications.
  • Hands-on experience managing diverse system environments using configuration management tools or software delivery platforms such as Spinnaker, Helm, or Flux.
  • Demonstrated expertise in deploying, supporting, and monitoring both new and existing services, platforms, and application stacks.
  • Solid familiarity with container orchestration and management using Kubernetes.

Nice To Haves

  • Strong critical thinking and a high degree of individual accountability.
  • Effective communication and collaboration skills.
  • A genuine passion for Infrastructure as a Service (IaaS).
  • A commitment to automation and operational efficiency.
  • Ownership of projects from design through delivery.
  • A solutions-oriented approach, coupled with the ability to gain alignment on technical direction.
  • Consistent and timely execution of design implementations aligned with project objectives.
  • The ability to provide constructive technical feedback, fostering team-wide growth and continuous improvement.

Responsibilities

  • Oversees the full infrastructure stack — from low-level nodes to the complete network architecture — ensuring our platform remains highly available, resilient, and efficient at scale.
  • Address challenges through a strong foundation in cloud object storage, data analysis, automation, collaboration, and advanced expertise in Kubernetes.
  • Participates in a rotating on-call schedule, including occasional weekend coverage when necessary.
  • Deploying, supporting, and monitoring both new and existing services, platforms, and application stacks.
  • Managing diverse system environments using configuration management tools or software delivery platforms such as Spinnaker, Helm, or Flux.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service