Senior Machine Learning Ops Engineer

Peraton•Ashburn, VA

100d

About The Position

Peraton is seeking an experienced Senior Machine Learning Ops Engineer to support U.S. Customs and Border Protection (CBP) by ensuring the secure, reliable, and scalable operation of machine learning systems within CBP’s analytics and intelligence support programs. This role operationalizes AI solutions by building the platforms, pipelines, monitoring, and governance controls that move models from research into mission-ready production environments. The ideal candidate combines strong reliability engineering, AI/ML lifecycle expertise, security awareness, cost optimization discipline, and cross-functional collaboration skills. Support will be provided across multiple mission locations: Ashburn, VA Sterling, VA Washington, D.C.

Requirements

Minimum of 8 years with BS/BA; or 12 years with HS diploma/equivalent in lieu of a degree.
5+ years of experience in MLOps, DevOps, Site Reliability Engineering (SRE), or platform engineering supporting production systems.
Experience designing and operating ML platforms or AI infrastructure in cloud environments (AWS, Azure, or GCP).
Strong experience with Kubernetes, Docker, and containerized workloads.
Proficiency in Python (and/or Java, Go).
Experience implementing CI/CD pipelines, monitoring frameworks, and secure deployment practices.
Knowledge of machine learning lifecycle management, model monitoring, and data pipeline operations.
Ability to obtain and maintain CBP BI suitability
U.S. Citizenship required

Nice To Haves

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
Experience with ML platforms such as MLflow, Kubeflow, SageMaker, Azure ML, or Vertex AI.
Familiarity with distributed training, GPU workloads, or LLMOps environments.
Experience supporting federal, regulated, or high-security environments.
Relevant cloud, Kubernetes, or DevOps certifications.

Responsibilities

Lead the architecture, design, and operation of scalable ML platforms supporting model training, batch processing, and real-time inference.
Establish and maintain enterprise MLOps frameworks and best practices for model lifecycle management, reproducibility, and deployment.
Design and manage CI/CD pipelines for machine learning code, data pipelines, and model artifacts.
Deploy and operate containerized workloads using Kubernetes and cloud-native infrastructure.
Implement automated workflows for model versioning, validation, retraining, and drift detection.
Develop monitoring solutions for system health, model performance, latency, reliability, and operational metrics.
Define and maintain SLOs/SLAs, supporting incident response and operational resilience for production ML systems.
Collaborate with data scientists, data engineers, and platform teams to productionize machine learning models at scale.
Ensure compliance with federal security, governance, and Responsible AI practices, including IAM/RBAC, encryption, and audit logging.
Provide technical leadership and mentorship while developing reusable tools, documentation, and platform standards that improve operational efficiency.