Kubernetes / OpenShift AI Platform Engineer

TEKsystems•Chandler, AZ

5d•$70 - $89•Hybrid

About The Position

We are seeking a Kubernetes / OpenShift AI Platform Engineer to design, build, and optimize enterprise-scale infrastructure supporting advanced AI/ML workloads. This role sits at the intersection of platform engineering, DevOps, and AI infrastructure, enabling model development, training, and real-time inference in a highly regulated environment. You will work cross-functionally with AI/ML engineers, data scientists, DevOps, and infrastructure teams to deliver scalable, secure, and high-performance AI platforms.

Requirements

5–7+ years of experience with Kubernetes (production environments)
Strong experience with Red Hat OpenShift in enterprise environments
5–7+ years of hands-on experience with Docker and containerization technologies
Strong proficiency in Python for automation and platform engineering
Solid experience working in Linux environments (systems, networking, storage)
Experience with AWS or other cloud platforms
Hands-on experience with Terraform and CI/CD tools (e.g., Jenkins)
Experience supporting AI/ML platforms, model deployment pipelines, or similar workloads
Deep understanding of Kubernetes architecture and cluster lifecycle management
Proven ability to operate in large-scale, fast-paced enterprise environments
Strong problem-solving and troubleshooting skills across distributed systems
Experience building platforms that support other engineering teams

Nice To Haves

Experience with AI/ML frameworks such as: PyTorch, TensorFlow, Triton Inference Server, vLLM
Experience with agentic AI systems or intelligent agents
Familiarity with Kubernetes Operators and Helm
Familiarity with GitOps practices and platform standardization
Strong understanding of Observability (Prometheus, Grafana)
Strong understanding of Kubernetes/OpenShift security models (SCCs, RBAC, etc.)

Responsibilities

Design and manage Kubernetes and OpenShift clusters at enterprise scale
Build and optimize infrastructure for AI/ML model training and inference workloads
Develop automation for deployment, configuration, patching, and platform operations using Python
Support GPU-enabled workloads and high-performance compute environments
Implement and maintain CI/CD pipelines, GitOps workflows, and infrastructure-as-code (Terraform)
Ensure platform reliability, scalability, and performance optimization
Implement security best practices including RBAC, network policies, and secrets management
Enable observability through Prometheus, Grafana, and logging frameworks
Collaborate with engineering teams to standardize and streamline AI platform environments