Senior AI Ops Engineer

Peraton•Ashburn, VA

About The Position

U.S. Customs and Border Protection (CBP) is seeking an AI Operations (AIOps) Engineer to build and operate secure, scalable, and mission-ready AI/ML and LLM systems in production activities in support of the CBP analytics and intelligence support program. This role ensures reliable deployment, monitoring, governance, and continuous improvement of enterprise AI platforms supporting mission-critical analytics and decision support. The ideal candidate brings strong reliability engineering, AI/ML operational expertise, security and compliance awareness, cost optimization discipline, and the ability to collaborate across technical and mission teams in a high-assurance federal environment. Support will be provided across multiple mission locations: Ashburn, VA Sterling, VA Washington, D.C.

Requirements

Minimum of 8 years with BS/BA; Minimum of 6 years with MS/MA. 12 years with a HS diploma/equivalent can be considered in lieu of a degree.
5+ years of experience in SRE, DevOps, Platform Engineering, or ML Engineering supporting production systems.
Hands-on experience with: Kubernetes and containerization Cloud platforms (AWS, Azure, or GCP) CI/CD and observability tooling
Proficiency in Python (and/or Java/Go).
Working knowledge of MLOps and LLMOps practices.
Strong understanding of security, IAM/RBAC, encryption, and AI data governance.
Active Top Secret clearance
Ability to obtain and maintain required CBP BI suitability
U.S. Citizenship required.

Nice To Haves

Bachelor’s degree in computer science, Engineering, or related field (preferred).
Experience with ML platforms (MLflow, Kubeflow, SageMaker, Azure ML, Vertex AI).
Familiarity with data quality and orchestration tools.
Experience with GPU orchestration and distributed processing.
Certifications (Cloud, CKA/CKAD, Security+).
Experience in regulated or federal environments.

Responsibilities

Design and operate cloud and on-prem AI/ML platforms supporting model training, batch scoring, real-time inference, and RAG-based LLM applications.
Deploy containerized workloads to Kubernetes and manage high availability, autoscaling, and release strategies.
Integrate model serving frameworks and feature stores to support scalable production inference.
Build and maintain CI/CD pipelines for code, data, and models.
Implement model versioning, registries, promotion gates, and environment parity.
Develop reproducible training and deployment workflows using infrastructure-as-code and orchestration tools.
Implement monitoring for system health, model performance, and model quality (including drift and bias indicators).
Define and manage SLOs/SLAs and participate in incident response for AI services.
Develop playbooks to address outages, regressions, and quality degradation.
Operate LLM applications and RAG pipelines.
Manage vector databases and evaluation frameworks.
Implement AI safety controls, including prompt validation, content filtering, PII protection, and performance evaluation.
Optimize inference efficiency and infrastructure utilization.
Enforce authentication, authorization, encryption, and secrets management best practices.
Implement controls supporting PII protection, audit logging, and data governance.
Align AI operations with NIST AI RMF, ISO 27001, SOC 2, and DHS/CBP security policies.
Support Responsible AI practices including bias testing, explainability, and human oversight.
Monitor and optimize compute (GPU/CPU), storage, and network utilization.
Implement autoscaling and cost-efficient infrastructure strategies.
Provide visibility into per-model costs and capacity planning.
Partner with data scientists and ML engineers to productionize models.
Develop reusable templates, documentation, and operational runbooks.
Translate mission and compliance requirements into technical platform capabilities.