Senior Site Reliability Engineer

Apple•Seattle, WA

About The Position

The Apple Services Engineering Cloud Services SRE organization is looking for a strong, enthusiastic developer to join as a member of this group. This person will have a tremendous amount of individual responsibility and influence over the direction the core platform of many critical Apple internet services takes for years to come. You are someone with ideas and real passion for software delivered as a service to improve reuse, efficiency, and simplicity. This engineer’s work will affect hundreds of millions of users and be essential to the success of some of the most visible current and future Apple features. We are domain experts in fleet management, systems, and software engineering. We build automations, instrument reliability tools, and respond to alerts and incidents which may pose a risk to the reliability of the platform. Team’s focus is on infrastructure capabilities and processes, improving the reliability and efficiency of the systems, at scale.We are looking for a strong, enthusiastic developer to join as a member of this group. This person will have a tremendous amount of individual responsibility and influence over the direction the core platform of many critical Apple internet services takes for years to come. You are someone with ideas and real passion for software delivered as a service to improve reuse, efficiency, and simplicity. This engineer’s work will affect hundreds of millions of users and be essential to the success of some of the most visible current and future Apple features.

Requirements

Bachelors or Masters in Computer Science, Computer Engineering, or equivalent experience.
5+ years of experience developing platform services
Experience with large scale server provisioning and maintenance (OpenStack Ironic, Metal3, MAAS, xCat, Netbox, Tinkerbell)
Experience with development within Kubernetes ecosystem, including operator framework, controllers and CRDs
Understanding of base internet infrastructure services including DNS, DHCP, LDAP, server virtualization, server monitoring in critical, large scale distributed systems experience, combining Hardware, Operating Systems and Software
Understanding of SRE principals, including monitoring, alerting, error budgets, fault analysis, and other common reliability engineering concepts, with a keen eye for opportunities to eliminate toil by code and process improvements.

Nice To Haves

Hardware bootstrap and associated security (PXE, BIOS, TPM, secure boot, trusted computing)
Experience with hyperscale server provisioning and maintenance (OpenStack Ironic, Metal3, MAAS, xCat, Netbox, Tinkerbell)
Structured or unstructured storage and caching
Automating operations processes via services and tools
Configuration management and fleet orchestration via Puppet, Chef, Ansible, or others
Cloud Services (AWS S3/EC2/CloudFront or equivalent)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume