Senior Site Reliability Engineer

Accenture Federal Services•Arlington, VA

51d

About The Position

The Site Reliability Engineer will ensure the reliability, performance, and scalability of the Client System. The engineer will define and track Key Performance Indicators and Service Level Objectives, identify and resolve performance bottlenecks, and perform root cause analysis on incidents to implement preventative measures and enhance system efficiency. This role will design and implement monitoring and alerting systems to provide visibility into system health and performance, develop and maintain runbooks and playbooks for operational procedures, and automate routine operational tasks to improve efficiency and reduce human error. The engineer will conduct capacity planning to ensure the system can handle expected loads, implement load testing to validate system performance under stress, and establish disaster recovery procedures to minimize downtime. They will participate in on-call rotations to respond to system incidents, collaborate with development teams to improve application reliability and performance, and implement chaos engineering practices to identify weaknesses before they impact users.

Requirements

Bachelor’s degree (or 4 years of additional experience)
8 years of experience managing reliability, uptime, and automating operations for large IT systems
Must meet DoD 8140 requirements
Active TS/SCI clearance

Responsibilities

Define and track Key Performance Indicators and Service Level Objectives
Identify and resolve performance bottlenecks
Perform root cause analysis on incidents to implement preventative measures and enhance system efficiency
Design and implement monitoring and alerting systems to provide visibility into system health and performance
Develop and maintain runbooks and playbooks for operational procedures
Automate routine operational tasks to improve efficiency and reduce human error
Conduct capacity planning to ensure the system can handle expected loads
Implement load testing to validate system performance under stress
Establish disaster recovery procedures to minimize downtime
Participate in on-call rotations to respond to system incidents
Collaborate with development teams to improve application reliability and performance
Implement chaos engineering practices to identify weaknesses before they impact users