About The Position

Apple's Artificial Intelligence and Data Platforms (AiDP) team is seeking an experienced Site Reliability Engineering (SRE) Manager to support scalable and resilient distributed systems that power Apple's data pipelines and analytics platforms. Our Enterprise Data Warehouse landscape caters to a wide variety of real-time, near real-time and batch analytical solutions. These solutions are an integral part of business functions like Sales, Operations, Finance, AppleCare, Marketing and Internet Services, enabling business drivers to make critical decisions. We utilize proprietary and open source technologies such as Kafka, Spark, Iceberg, Airflow, and others to build these solutions. If you are passionate about addressing infrastructure challenges at scale, both on-premises and in the cloud, and focused on optimizing scalable solutions by prioritizing ease of use and maintenance, you will discover exciting opportunities in AiDP. As a hands-on SRE Manager, you’ll lead by example—actively driving operational excellence, contributing to code, and ensuring system reliability. You will be deeply involved in incident response across complex, distributed data platforms designed to support data exploration, analytics, and reporting solutions. These platforms operate at the unique intersection of high data volume and hybrid infrastructure, spanning both cloud and on-premise environments. We are looking for a collaborative and innovative leader who thrives under tight deadlines, excels at solving complex problems, and consistently delivers high-quality, forward-thinking solutions.

Requirements

  • Bachelor’s degree or equivalent, with 10+ years of experience in the SRE domain and at least 3 years in a management role focused on leading, hiring, developing and building teams
  • Hands-on experience building, supporting/maintaining applications. large scale distributed systems in cloud or hybrid environments
  • Strong knowledge of cloud infrastructure & services (e.g., AWS, GCP, Kubernetes), Observability tools (e.g: Prometheus, Grafana, CloudWatch)
  • Strong Programming experience in one of the programming languages - Python or Java or Scala
  • Proven ability to lead incident response, perform root cause analysis, and drive system reliability improvements.
  • Able to lead across organizational boundaries and diverse reporting structures.
  • Hands-on experience supporting enterprise data systems on distributed architectures
  • Expertise in cloud-native services, including ETL frameworks (Apache Spark, Flink), and messaging systems (Kafka)
  • Solid understanding of system design, data structures, and incident management best practices

Nice To Haves

  • Exposure to data visualization tools such as Tableau, Business Objects, ThoughtSpot, with experience supporting and troubleshooting issues related to dashboards and reports
  • Experience with modern & distributed databases such as Snowflake, Cassandra, SingleStore, and SAP HANA
  • Experience using GenAI or automation tools for issue detection, alerting, or remediation

Responsibilities

  • Provide technical leadership and guidance to SRE team by applying hands-on skills and continuous learning.
  • Build and mentor a world-class engineering team that partners closely with platform teams to design scalable, reliable systems, while contributing actively to both platform and application code.
  • Manage Infrastructure as Code (IaC) and develop tooling to enhance engineering productivity.
  • Lead initiatives for cost optimization and operational efficiency at scale.
  • Actively participate in on-call rotations and resolve critical production issues.
  • Lead response efforts during major incidents and serve as the primary escalation point for complex problems.
  • Perform root cause investigations and ensure follow-up with actionable postmortems and infrastructure hardening initiatives.
  • Implement fixes—in code, infrastructure, or processes—to prevent recurrence.
  • Partner closely with engineering teams to troubleshoot issues, deploy fixes, and enhance system reliability.
  • Champion operational excellence through direct technical contributions.
  • Take ownership of Application Security, Disaster Recovery & Application Documentation to reflect latest system architecture and configurations.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service