Site Reliability Engineer

Accenture Federal Services•Arlington, VA

About The Position

At Accenture Federal Services, our purpose is to make the US federal government stronger and safer, and to improve the lives of its citizens. We are a technology company within the global Accenture network, comprising over 13,000 employees dedicated to leveraging technology and innovation for clients across defense, national security, public safety, civilian, and military health organizations. Recognized by Glassdoor as a Top 100 Best Place to Work, we foster a supportive and collaborative environment that empowers growth through hands-on experience, certifications, and industry training. Join us to drive meaningful change and advance government missions. As a Site Reliability Engineer, you will be instrumental in advancing operational AI adoption within a sophisticated Hub-and-Spoke architecture. Your core responsibilities will involve ensuring the reliability, scalability, and continuous monitoring of enterprise AI systems that underpin mission-critical applications and enterprise AI governance.

Requirements

Experience with OpenTelemetry, Prom, Grafana, Loki, and Tempo to enhance system observability and performance
Hands-on experience with SLO/SLA management, FinOps practices, and advanced monitoring techniques to proactively identify and resolve issues before they impact mission outcomes
Exposure to complex integration efforts, continuous delivery pipelines, and mission-focused operational environments
Experience with reliability engineering, incident response and FinOps
Must be a U.S Citizen
An active TS/SCI clearance is required

Responsibilities

Ensure the reliability, scalability, and performance of enterprise AI systems within a modern Hub-and-Spoke architecture
Lead incident response efforts to minimize downtime and maintain service continuity
Implement and manage SLOs/SLAs, capacity planning, and performance optimization strategies
Operate and enhance observability platforms using OpenTelemetry, Prometheus, Grafana, Loki, and Tempo
Drive FinOps practices to optimize operational costs and resource utilization
Collaborate with cross-functional teams in AI, DevSecOps, data engineering, platform engineering, and cybersecurity
Integrate monitoring and continuous feedback mechanisms for mission applications and agentic AI systems
Support enterprise AI governance and scalable software delivery through robust operational workflows
Proactively identify and resolve reliability and performance issues in production environments
Incident response, performance optimization, and capacity planning
Maintain robust observability operations and support scalable software delivery for agentic AI systems