Site Reliability Engineer

General Dynamics Mission Systems, Inc•,

11d•$142,696 - $158,303•Remote

About The Position

This role focuses on Site Reliability Engineering (SRE) principles for AI services, with a unique opportunity to build an SRE practice from scratch. The engineer will define and implement SLOs, monitoring, incident response, operational readiness reviews, capacity planning, and toil elimination for AI services. Unlike traditional SRE roles, this position will address unique AI failure modes such as model drift and token budget exhaustion. The engineer will have direct authority over whether AI services go live based on operational readiness reviews. The role does not involve application development, AI model building, or infrastructure provisioning, but rather ensuring the operability and reliability of these components.

Requirements

Bachelor’s degree in Computer Science, Software Engineering, or a related field, plus 5 years of experience; or Master’s degree plus 3 years of experience.
Production SRE or DevOps experience, with a proven track record of owning system reliability.
Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, Datadog, ELK, CloudWatch, or similar.
Strong scripting and automation skills in Python or Bash, and experience with infrastructure-as-code tools like Terraform or CloudFormation.
Experience with containerized environments including Docker and Kubernetes.
Experience defining and managing SLOs, error budgets, and incident response procedures in production.
U.S. citizenship required.
Department of Defense Secret security clearance required at time of hire.

Nice To Haves

Experience with AI/ML production systems, including model serving, inference monitoring, and token cost tracking.
Multi-cloud experience (AWS, Azure, GCP) with knowledge of cloud-native monitoring and logging services.
Experience building operational readiness review processes or production launch checklists.
Familiarity with Google SRE principles.
Experience in environments where reliability has compliance or safety implications (defense, healthcare, finance, or critical infrastructure).

Responsibilities

Define service level objectives (SLOs) for AI services and establish error budgets to drive engineering decisions.
Build and maintain monitoring, logging, and alerting infrastructure for AI services.
Establish incident management procedures, lead post-incident reviews, and drive corrective actions.
Conduct operational readiness reviews to ensure AI services meet reliability, security, and operational standards before going live.
Track resource consumption, forecast capacity needs, and monitor costs (tokens, compute, storage) for AI services.
Identify and automate repetitive operational tasks to eliminate toil.