Senior Site Reliability Engineer

Accenture Federal ServicesArlington, VA

About The Position

The Site Reliability Engineer will ensure the reliability, performance, and scalability of the Client System. The engineer will define and track Key Performance Indicators and Service Level Objectives, identify and resolve performance bottlenecks, and perform root cause analysis on incidents to implement preventative measures and enhance system efficiency. This role will design and implement monitoring and alerting systems to provide visibility into system health and performance, develop and maintain runbooks and playbooks for operational procedures, and automate routine operational tasks to improve efficiency and reduce human error. The engineer will conduct capacity planning to ensure the system can handle expected loads, implement load testing to validate system performance under stress, and establish disaster recovery procedures to minimize downtime. They will participate in on-call rotations to respond to system incidents, collaborate with development teams to improve application reliability and performance, and implement chaos engineering practices to identify weaknesses before they impact users.

Requirements

  • Bachelor’s degree (or 4 years of additional experience)
  • 8 years of experience managing reliability, uptime, and automating operations for large IT systems
  • Must meet DoD 8140 requirements
  • Active TS/SCI clearance

Responsibilities

  • Define and track Key Performance Indicators and Service Level Objectives
  • Identify and resolve performance bottlenecks
  • Perform root cause analysis on incidents to implement preventative measures and enhance system efficiency
  • Design and implement monitoring and alerting systems to provide visibility into system health and performance
  • Develop and maintain runbooks and playbooks for operational procedures
  • Automate routine operational tasks to improve efficiency and reduce human error
  • Conduct capacity planning to ensure the system can handle expected loads
  • Implement load testing to validate system performance under stress
  • Establish disaster recovery procedures to minimize downtime
  • Participate in on-call rotations to respond to system incidents
  • Collaborate with development teams to improve application reliability and performance
  • Implement chaos engineering practices to identify weaknesses before they impact users

Benefits

  • Hands-on experience
  • Certifications
  • Industry training
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service