The Role We’re seeking a strategic technologist and Kubernetes/Container platform engineer with overall 15 + years’ experience to lead and scale our Kubernetes container platform and middleware stack. You will architect, deploy, operate, and evolve high‑availability Kubernetes infrastructure (EKS and OpenShift), ensuring seamless middleware operations (Kafka, Redis Enterprise Cluster, 3Scale API Gateway). You will automate container deployment (ArgoCD), enforce container security and network policies, oversee capacity planning, helm chart development, and define policies-as-code governance across the environment. Your mission is to deliver a hardened, future-ready platform that enables multiple engineering teams to develop, deploy, and scale cloud-native applications reliably and securely. We’ll trust you to: Design and implement infrastructure abstractions and APIs that simplify deploying AI workloads using Kubernetes-native operations and GitOps patterns. Architect, deploy, and manage Kubernetes platforms (AWS EKS and Red‑Hat OpenShift) across different environments. Implement GitOps workflows with ArgoCD to manage declarative app deployments. Design and operate middleware infrastructure: Highly available Kafka clusters (mirroring, partitioning, tooling) Managed Redis Enterprise clusters (sharding, high‑availability, replication) 3Scale API Gateway development and administration Build and manage helm charts, templating, parameterization, and versioning for both platform and middleware stacks. Enforce container security and policy governance using policies-as-code tools (e.g. OPA, Kyverno), scanning (e.g. Clair, Snyk), and automated admission controls. Implement network policies (Kubernetes NetworkPolicy / Calico) to enforce segmentation and micro‑segmentation. Configure and manage service mesh (e.g. Istio, Linkerd) for observability, traffic controls, and secure service‑to‑service communication. Conduct capacity planning, cluster sizing, resource tuning, and autoscaling strategies. Conduct architecture reviews, train engineers, and drive platform best practices across teams. Partner with SREs to define platform SLAs, uptime targets, resilience benchmarks, and alerting/monitoring. Lead incident response and root cause analysis, automating recovery workflows and improving platform resiliency.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed
Number of Employees
501-1,000 employees