Site Reliability Engineer — Info Apps

Apple•Cupertino, CA

About The Position

In this role, you will be a key pillar of our engineering organization, ensuring that our services remain highly available and performant. Your impact will include: System Architecture: Designing and implementing the next generation of our telemetry and alerting systems. Reliability Engineering: Defining SLOs/SLIs and ensuring our monitoring strategy captures the true health of the user experience. Operational Excellence: Reducing operational load through software; if you have to do it twice, you’ll want to automate it. Collaboration: Partnering with App Dev teams to influence the "design for reliability" phase of the software development lifecycle. Mentorship: Acting as a technical lead for junior members and off-shore partners, providing guidance on runbook development and disaster recovery.

Requirements

5+ years in SRE, DevOps, or Infrastructure roles with a proven track record of managing high-traffic, internet-facing production environments.
Deep experience building and operating container orchestration systems (EKS/GKE/Vanilla K8s). You should be comfortable troubleshooting from the networking layer up to the application pod.
Expert knowledge of the 4 Golden Signals (Latency, Traffic, Errors, and Saturation). Proficiency with tools like Prometheus, Grafana, and Splunk is essential.
Hands-on experience designing and maintaining resilient infrastructure on public cloud providers (AWS, GCP, or Azure).
Strong ability to code at a scripting level (Python or Go preferred) to automate toil and build self-healing systems.
Experience leading incident response, performing Root Cause Analysis (RCA), and implementing blameless post-mortems to improve system resilience.
Proficient in Terraform, CloudFormation, or Pulumi to manage immutable infrastructure.

Nice To Haves

Specialized experience operating and tuning Solr or Elasticsearch at scale.
Strong understanding of TCP/IP, Load Balancing (ELB/ALB), and Service Mesh (Istio/Linkerd).
Experience with Kafka, Cassandra, or Postgres in a distributed environment.

Responsibilities

Designing and implementing the next generation of our telemetry and alerting systems.
Defining SLOs/SLIs and ensuring our monitoring strategy captures the true health of the user experience.
Reducing operational load through software; if you have to do it twice, you’ll want to automate it.
Partnering with App Dev teams to influence the "design for reliability" phase of the software development lifecycle.
Acting as a technical lead for junior members and off-shore partners, providing guidance on runbook development and disaster recovery.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume