Senior AI Ops Engineer

PeratonAshburn, VA
1d$104,000 - $166,000

About The Position

U.S. Customs and Border Protection (CBP) is seeking an AI Operations (AIOps) Engineer to build and operate secure, scalable, and mission-ready AI/ML and LLM systems in production activities in support of the CBP analytics and intelligence support program . This role ensures reliable deployment, monitoring, governance, and continuous improvement of enterprise AI platforms supporting mission-critical analytics and decision support. The ideal candidate brings strong reliability engineering, AI/ML operational expertise, security and compliance awareness, cost optimization discipline, and the ability to collaborate across technical and mission teams in a high-assurance federal environment. Support will be provided across multiple mission locations: Ashburn, VA Sterling, VA Washington, D.C.

Requirements

  • Minimum of 8 years with BS/BA; Minimum of 6 years with MS/MA. 12 years with a HS diploma/equivalent can be considered in lieu of a degree.
  • 5+ years of experience in SRE, DevOps, Platform Engineering, or ML Engineering supporting production systems.
  • Hands-on experience with: Kubernetes and containerization Cloud platforms (AWS, Azure, or GCP) CI/CD and observability tooling
  • Proficiency in Python (and/or Java/Go).
  • Working knowledge of MLOps and LLMOps practices.
  • Strong understanding of security, IAM/RBAC, encryption, and AI data governance.
  • Active Top Secret clearance
  • Ability to obtain and maintain required CBP BI suitability
  • U.S. Citizenship required.

Nice To Haves

  • Bachelor’s degree in computer science, Engineering, or related field (preferred).
  • Experience with ML platforms (MLflow, Kubeflow, SageMaker, Azure ML, Vertex AI).
  • Familiarity with data quality and orchestration tools.
  • Experience with GPU orchestration and distributed processing.
  • Certifications (Cloud, CKA/CKAD, Security+).
  • Experience in regulated or federal environments.

Responsibilities

  • AI Platform Engineering Design and operate cloud and on-prem AI/ML platforms supporting model training, batch scoring, real-time inference, and RAG-based LLM applications.
  • Deploy containerized workloads to Kubernetes and manage high availability, autoscaling, and release strategies.
  • Integrate model serving frameworks and feature stores to support scalable production inference.
  • CI/CD & Model Lifecycle Build and maintain CI/CD pipelines for code, data, and models.
  • Implement model versioning, registries, promotion gates, and environment parity.
  • Develop reproducible training and deployment workflows using infrastructure-as-code and orchestration tools.
  • Observability & Reliability Implement monitoring for system health, model performance, and model quality (including drift and bias indicators).
  • Define and manage SLOs/SLAs and participate in incident response for AI services.
  • Develop playbooks to address outages, regressions, and quality degradation.
  • LLMOps & Guardrails Operate LLM applications and RAG pipelines.
  • Manage vector databases and evaluation frameworks.
  • Implement AI safety controls, including prompt validation, content filtering, PII protection, and performance evaluation.
  • Optimize inference efficiency and infrastructure utilization.
  • Security, Compliance & Governance Enforce authentication, authorization, encryption, and secrets management best practices.
  • Implement controls supporting PII protection, audit logging, and data governance.
  • Align AI operations with NIST AI RMF, ISO 27001, SOC 2, and DHS/CBP security policies.
  • Support Responsible AI practices including bias testing, explainability, and human oversight.
  • Cost & Performance Optimization Monitor and optimize compute (GPU/CPU), storage, and network utilization.
  • Implement autoscaling and cost-efficient infrastructure strategies.
  • Provide visibility into per-model costs and capacity planning.
  • Collaboration & Enablement Partner with data scientists and ML engineers to productionize models.
  • Develop reusable templates, documentation, and operational runbooks.
  • Translate mission and compliance requirements into technical platform capabilities.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service