Mission users are increasingly relying on agentic AI systems to support complex workflows, accelerate analysis, and improve decision advantage. Unlike traditional software systems, agentic AI platforms introduce operational complexity across model invocations, workflow orchestration, tool integrations, retrieval and knowledge layers, safety controls, and probabilistic outputs. As an AI Platform Site Reliability Engineer (SRE), you’ll help ensure the availability, resiliency, observability, and operational integrity of an AWS GovCloud-based agentic AI platform supporting national defense missions. In this role, you’ll serve as the reliability owner for production AI operations. You’ll work cross-functionally with multiple stakeholders, including with cloud engineering, platform engineering, AI agent development, MLOps, data science, and customer knowledge teams to operationalize their work in production through monitoring, alerting, Service Level Indicators (SLI) and Service Level Objectives (SLO) management, incident response, ticket triage, change control, and automation. You won’t be duplicating model development, data science, or cloud platform build responsibilities. Instead, you’ll ensure that the system, its agents, and their supporting services remain healthy, traceable, performant, and supportable in mission environments. You’ll define and monitor operational health signals across agent workflows, model latency, session and task success, knowledge-base ingestion health, tool and API dependencies, guardrail or safety interventions, throttling, token usage, drift indicators, and service degradation patterns. You’ll help reduce operational toil by building dashboards, alarms, runbooks, and automated remediation workflows, while driving post-incident learning and continuous improvement.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Number of Employees
5,001-10,000 employees