Site Reliability Engineer

TEKsystems•Chandler, AZ

4d•$60 - $63•Hybrid

About The Position

Cloud Site Reliability Engineer (SRE) for Internal Cloud. Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Troubleshoot issues across the entire stack: hardware, software, application and network Perform deep dives into both systemic and latent reliability issues; partner with engineering and operation teams across the organization to produce and roll out fixes. Drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization. Identify and drive opportunities to improve automation for the cloud services Scope and create automation for deployment, management and visibility of our services Troubleshoot issues across the entire stack: hardware, software, application and network Perform deep dives into both systemic and latent reliability issues; partner with engineering and operation teams across the organization to produce and roll out fixes. Identify and drive opportunities to improve automation for the cloud services

Requirements

Cloud
unix
linux
terraform
java
python
ansible
shell

Nice To Haves

Experience at a large, highly regulated company.
Ideally financial services experience.

Responsibilities

Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Troubleshoot issues across the entire stack: hardware, software, application and network
Perform deep dives into both systemic and latent reliability issues; partner with engineering and operation teams across the organization to produce and roll out fixes.
Drive standardization efforts across multiple disciplines and services in conjunction with embedded SREs throughout the organization.
Identify and drive opportunities to improve automation for the cloud services
Scope and create automation for deployment, management and visibility of our services
Troubleshoot issues across the entire stack: hardware, software, application and network
Perform deep dives into both systemic and latent reliability issues; partner with engineering and operation teams across the organization to produce and roll out fixes.
Identify and drive opportunities to improve automation for the cloud services