AI Platform Engineer

CACI International•Huntsville, AL

1d•Onsite

About The Position

The F3I-3 program at CACI is seeking an AI Platform Engineer to join their team and help solve one of their customer’s toughest problems. This role involves designing, architecting, and leading the development of AI/ML platform components and agentic AI applications. The engineer will implement and optimize AI/ML algorithms, write production-quality code primarily in Python (potentially Go, C, C++, or Rust), and deploy container-based applications using CI/CD best practices and MLOps infrastructure on platforms like Red Hat OpenShift and public clouds (AWS, GCP, Azure). The position requires staying current with Generative AI advancements, performing root cause analysis, and solving complex problems in a dynamic environment. A key responsibility is architecting and maintaining data pipelines for training data, model artifacts, and inference logs within a governed data lake. The role also includes designing, implementing, and operating a unified MLOps platform for both on-premises and cloud-hosted Kubernetes clusters, enabling rapid onboarding of new Agentic AI services and ensuring consistent governance. Collaboration with research scientists, data scientists, product teams, and stakeholders is essential to translate prototypes into production-grade services, ensuring reproducibility, security, and compliance. Mentoring junior engineers and contributing to knowledge bases are also part of the role. Performance optimization for inference workloads (GPU/CPU scaling, model quantization, batching strategies) and championing best practices in security, cost efficiency, and disaster recovery for hybrid infrastructure are critical.

Requirements

Current TS/SCI clearance with Polygraph
Bachelor's degree in Computer Science, Information Systems, Cybersecurity, or related field; or equivalent experience in systems engineering
7+ years of experience as a Platform Engineer, Systems Engineer, DevSecOps Engineer, or Infrastructure Engineer supporting classified DoW or Intelligence Community operations

Nice To Haves

Experience in Platform Development: Design, architect, document, and lead the development of AI/ML platform components and agentic AI applications
Experience in Coding & Implementation: Implement and optimize cutting-edge AI/ML algorithms and production-quality code, primarily using Python and possibly Go, C, C++, or Rust
Experience in Deployment & Operations: Build and deploy container-based applications to platforms like Red Hat OpenShift, public clouds (AWS, GCP, Azure), leveraging CI/CD best practices
Ability to work effectively in distributed, mission-focused teams and adapt to rapidly changing operational priorities

Responsibilities

Design, architect, document, and lead the development of AI/ML platform components and agentic AI applications.
Implement and optimize cutting-edge AI/ML algorithms and production-quality code, primarily using Python and possibly Go, C, C++, or Rust.
Build and deploy container-based applications to platforms like Red Hat OpenShift, public clouds (AWS, GCP, Azure), leveraging CI/CD best practices and MLOps infrastructure.
Stay current with advancements in Generative AI and related technologies, conduct root cause analysis, and solve complex problems in a dynamic environment.
Architect and maintain data pipelines that feed training data, model artifacts, and inference logs into a governed data lake (S3, on prem object store).
Design, implement, and operate a unified MLOps platform that supports both on-premises and commercial cloud platforms hosted Kubernetes clusters.
Enable rapid onboarding of new Agentic AI services and provide consistent governance across environments.
Work closely with research scientists, data scientists, product teams, and stakeholders to translate Agentic AI prototypes into production grade services, ensuring reproducibility, security, and compliance.
Mentor junior engineers and contribute to internal knowledge bases, upskilling, and review processes.
Drive performance optimization for inference workloads (GPU/CPU scaling, model quantization, batching strategies).
Champion best practices in security (IAM, network policies, secret management), cost efficiency, and disaster recovery for the hybrid infrastructure.

Benefits

flexible time off
robust learning resources
comprehensive benefits such as; healthcare, wellness, financial, retirement, family support, continuing education, and time off benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume