Dev Ops System Administrator (TS/SCI with Polygraph) #ESF1365

ExpertHiring•Scottsdale, AZ

2d•Onsite

About The Position

This role involves designing, implementing, and maintaining scalable and robust infrastructure for AI/ML model training and inference. The Dev Ops System Administrator will develop and manage CI/CD pipelines for automated building, testing, and deployment of AI applications and machine learning models. Responsibilities include administering and optimizing Linux-based systems and virtualized environments, managing containerization and orchestration platforms (e.g., Docker, Kubernetes) to deploy and scale ML services, and automating infrastructure provisioning, configuration management, and deployment processes using Infrastructure as Code (IaC) tools like Ansible or Terraform. The role also involves managing and allocating GPU resources efficiently for model training and other high-performance computing tasks, implementing and maintaining monitoring, logging, and alerting systems, and collaborating with development teams to support their infrastructure needs and troubleshoot issues.

Requirements

Bachelor’s degree in Computer Science, a related field or equivalent experience is required plus a minimum of 8 years of relevant experience; or Master’s degree plus 6 years of relevant experience.
Department of Defense TS/SCI with Polygraph security clearance is required at time of hire.
Advanced understanding of server-based operating systems.
Strong Linux/Container/AI Skills.
Subject matter expert (SME) with the ability to mentor others on administrating the server environment.
Enhanced troubleshooting skills within the server OS as well as both networking and storage technologies.
Hands-on experience developing, deploying, and supporting large-scale enterprise server solutions.

Nice To Haves

Experience working with or familiarity with AI/ML models is preferred.

Responsibilities

Design, implement, and maintain scalable and robust infrastructure for AI/ML model training and inference.
Develop and manage CI/CD pipelines for automated building, testing, and deployment of AI applications and machine learning models.
Administer and optimize Linux-based systems and virtualized environments.
Manage containerization and orchestration platforms (e.g., Docker, Kubernetes) to deploy and scale ML services.
Automate infrastructure provisioning, configuration management, and deployment processes using Infrastructure as Code (IaC) tools like Ansible or Terraform.
Manage and allocate GPU resources efficiently for model training and other high-performance computing tasks.
Implement and maintain monitoring, logging, and alerting systems to ensure platform health and performance.
Collaborate with development teams to support their infrastructure needs and troubleshoot issues.