Senior Video SRE

AppleSan Diego, CA
5d

About The Position

As a Senior Video Site Reliability Engineer at Apple, you will be responsible for the reliability, scalability, and performance of our distributed applications that serve millions of users globally. You will build strong partnerships with application development teams, sister SRE teams, platform teams, product teams, as well as video encoding specialists to drive shared ownership of service reliability and maintain exceptional quality of experience for our customers. Your day-to-day work will include embedding reliability practices into the development lifecycle, building sophisticated monitoring and observability solutions, and developing automation to reduce operational toil. You will own critical infrastructure components that video services depend on, and use data-driven approaches to identify and eliminate single points of failure. You will lead reliability design reviews and define SLO frameworks that set the standard for service health. As part of the role, you will participate in on-call rotations, lead incident response efforts, and drive post-incident reviews that result in meaningful reliability improvements. This role offers the opportunity to work with complex JVM-based microservices and distributed systems technologies and influence architectural decisions that shape how Apple delivers video streaming content worldwide.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
  • 5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering with demonstrated senior-level impact.
  • Production ownership at scale including on-call/incident response, post incident reviews and driving operational improvements.
  • Strong understanding of Linux fundamentals and networking principles, with experience operating and debugging production systems.
  • Proficiency in at least one programming language (Shell, Python, Go, or similar) to reduce toil, build SRE tooling, and improve operability.
  • Hands-on experience with cloud infrastructure and container orchestration.
  • Excellent troubleshooting and root-cause analysis skills across the full technology stack.
  • Effective communicator who can collaborate with cross-functional partners to drive reliability outcomes.

Nice To Haves

  • Thorough understanding of distributed systems fundamentals, failure modes, and resilience patterns that prevent cascading outages.
  • Track record of building and continuously improving observability (metrics/logs/traces), alert quality, and incident response processes for complex, high-traffic environments.
  • Hands-on performance optimization, capacity planning, and reliability engineering (load testing, bottleneck analysis, degradation strategies).
  • Proven ability to build and operate Infrastructure as Code and CI/CD pipelines, including safe deployment practices and change risk controls.
  • Experience debugging and operating JVM-based applications in production (e.g., understanding of GC, thread analysis, heap profiling).
  • Working knowledge of database systems, key-value stores, caching layers, message queues, and storage infrastructure at scale.
  • Familiarity with video streaming technologies, codecs, protocols, and media delivery infrastructure.

Responsibilities

  • Embedding reliability practices into the development lifecycle
  • Building sophisticated monitoring and observability solutions
  • Developing automation to reduce operational toil
  • Owning critical infrastructure components that video services depend on
  • Using data-driven approaches to identify and eliminate single points of failure
  • Leading reliability design reviews
  • Defining SLO frameworks that set the standard for service health
  • Participating in on-call rotations
  • Leading incident response efforts
  • Driving post-incident reviews that result in meaningful reliability improvements
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service