TO-695 Senior AI Operations Engineer

Diverse Agile SolutionsWashington, DC
Hybrid

About The Position

Diverse Agile Solutions (DAS) is seeking a Senior AI Operations (AIOps) Engineer to lead the deployment, automation, monitoring, governance, and operational excellence of enterprise Artificial Intelligence and Machine Learning platforms supporting mission-critical federal systems. This position is ideal for someone who combines DevOps, MLOps, Cloud Engineering, Site Reliability Engineering (SRE), and AI platform operations into scalable, secure production environments. The Senior AI Operations Engineer will design, implement, automate, and support enterprise AI infrastructure and operational workflows. This individual will be responsible for deploying and maintaining production AI services, optimizing model performance, managing infrastructure automation, implementing monitoring solutions, and ensuring compliance with federal security requirements. The engineer will work closely with Data Scientists, Machine Learning Engineers, DevSecOps teams, Cloud Architects, Cybersecurity Engineers, and software developers to operationalize AI solutions across secure cloud environments.

Requirements

  • Bachelor's degree in Computer Science, Engineering, Information Systems, or related field
  • 8+ years of IT engineering experience
  • 5+ years supporting cloud infrastructure
  • 4+ years supporting AI/ML production environments
  • Experience deploying enterprise AI solutions
  • Strong knowledge of MLOps methodologies
  • Experience with CI/CD automation
  • Experience managing production Kubernetes clusters
  • Experience supporting containerized workloads
  • Experience with infrastructure automation
  • Strong Linux administration experience
  • Experience with scripting and automation
  • Excellent troubleshooting and analytical skills
  • Experience working in Agile environments
  • Strong communication and documentation skills
  • Cloud Platforms: AWS, Azure, Google Cloud Platform (GCP)
  • AI & Machine Learning: MLOps, Model deployment, Model monitoring, Model versioning, Model registry, Feature stores, Prompt management, Generative AI operations, AI inference optimization
  • DevOps & Automation: GitLab CI/CD, GitHub Actions, Jenkins, Terraform, Ansible, Helm, Docker, Kubernetes, OpenShift
  • Programming: Python, Bash, PowerShell, SQL, REST APIs
  • AI Frameworks: TensorFlow, PyTorch, Hugging Face Transformers, LangChain, MLflow, Kubeflow
  • Monitoring & Observability: Prometheus, Grafana, ELK Stack, Splunk, Datadog, CloudWatch, Azure Monitor
  • Data Technologies: PostgreSQL, MongoDB, Redis, Kafka, Snowflake, Vector Databases
  • Security: IAM, Secrets Management, Encryption, NIST 800-53, FedRAMP, Zero Trust Architecture

Nice To Haves

  • Experience supporting Federal Government customers
  • Experience operating AI workloads in AWS GovCloud
  • Experience with Azure AI Foundry
  • Experience with Azure OpenAI
  • Experience with Amazon Bedrock
  • Experience with Vertex AI
  • Experience implementing Responsible AI governance
  • Experience supporting Retrieval Augmented Generation (RAG) systems
  • Experience deploying LLM applications
  • Experience with GPU clusters
  • Experience with NVIDIA AI Enterprise
  • Experience with ServiceNow integrations
  • AWS Certified DevOps Engineer
  • AWS Certified Machine Learning Engineer
  • Microsoft Azure AI Engineer Associate
  • Microsoft Azure Administrator
  • Kubernetes Administrator (CKA)
  • HashiCorp Terraform Associate
  • Certified Kubernetes Security Specialist (CKS)
  • Google Professional Machine Learning Engineer
  • Security+
  • CISSP

Responsibilities

  • Deploy, operate, and support enterprise AI/ML production environments
  • Design scalable MLOps pipelines for continuous model deployment
  • Automate AI infrastructure using Infrastructure as Code (IaC)
  • Build CI/CD pipelines supporting machine learning workflows
  • Implement automated model validation and deployment strategies
  • Monitor model health, drift detection, performance, and availability
  • Optimize GPU and compute resource utilization
  • Configure logging, observability, and operational dashboards
  • Manage AI model lifecycle from development through production
  • Support containerized AI workloads using Kubernetes
  • Build automated rollback and disaster recovery capabilities
  • Secure AI infrastructure following Zero Trust principles
  • Implement AI governance and model version management
  • Integrate AI platforms with enterprise applications
  • Maintain operational documentation and runbooks
  • Participate in incident response and root cause analysis
  • Collaborate with DevSecOps teams to automate security controls
  • Optimize cloud costs for AI workloads
  • Ensure compliance with NIST, FedRAMP, and federal security standards

Benefits

  • Competitive salary
  • Comprehensive benefits package
  • 401(k)
  • Paid Time Off (PTO)
  • Paid Federal Holidays
  • Professional development and certification reimbursement
  • Career advancement opportunities
  • Collaborative, innovation-driven culture
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service