Kubernetes Platform Engineer – Control Plane & AI Infrastructure (hybrid) - 2010276

Cisco

1d•Hybrid

About The Position

Join our Platform Engineering Team of experienced Kubernetes engineers who design, build, and operate large-scale on-premises Kubernetes environments. Our mission is to deliver a highly reliable, scalable, and GPU-enabled platform to support AI/ML workloads, while applying intelligent automation (AIOps) to improve platform operations. As part of this team, you will directly manage the Kubernetes control plane, extend platform capabilities via controllers and operators, and implement automation to detect, predict, and self-heal operational issues. Candidates must have hands-on, on-prem control plane experience and able to work within a hybrid work model on site, as needed

Requirements

5+ years of software engineering experience
3+ years operating Kubernetes in production with hands-on control plane experience
Experience managing etcd (backup, restore, recovery) and performing control plane upgrades
Strong Go programming skills
Experience building Kubernetes operators/controllers and developing CRDs/webhooks
Deep understanding of scheduler, API server, controller loops, and reconciliation
Experience debugging and troubleshooting large-scale distributed systems
Candidates without on-prem or self-managed Kubernetes control plane experience will not be considered.

Nice To Haves

Experience in bare-metal or on-prem infrastructure
Experience supporting GPU-enabled workloads in Kubernetes
Exposure to building internal developer platforms
Contributions to CNCF or Kubernetes open-source projects
Hands-on experience with AI/ML-assisted operational automation (AIOps)
Experience applying statistical or ML techniques to operational data for platform reliability

Responsibilities

Design, build, and operate self-managed Kubernetes clusters (OpenShift / Anthos)
Manage and maintain etcd (backup, restore, quorum management, defrag)
Perform control plane upgrades and lifecycle management
Tune API server, scheduler, and controller manager for performance and reliability
Debug node-level and control-plane issues across large clusters
Implement networking (CNI), storage (CSI), and ingress integrations
Implement and extend runbook automation frameworks to reduce operational toil
Integrate AI agents that monitor cluster telemetry, detect anomalies, and trigger automated workflows (e.g., Slack notifications, remediation scripts)
Apply statistical or ML-based models on operational data from Splunk, Prometheus, and Kubernetes to predict failures, capacity saturation, or workload misbehavior
Build self-healing controllers and automated remediation pipelines
Implement predictive capacity planning and intelligent alert suppression workflows
Build Kubernetes controllers and operators (Go + controller-runtime)
Develop CRDs and admission webhooks to extend platform functionality
Automate cluster lifecycle and multi-cluster operations
Implement policies for workload isolation, governance, and compliance
Enable GPU and high-performance infrastructure for AI/ML workloads
Optimize scheduler and resource allocation for memory- and compute-intensive workloads
Support orchestration of AI/ML pipelines

Benefits

U.S. employees are offered benefits, subject to Cisco’s plan eligibility rules, which include medical, dental and vision insurance, a 401(k) plan with a Cisco matching contribution, paid parental leave, short and long-term disability coverage, and basic life insurance.
Employees may be eligible to receive grants of Cisco restricted stock units, which vest following continued employment with Cisco for defined periods of time.
10 paid holidays per full calendar year, plus 1 floating holiday for non-exempt employees
1 paid day off for employee’s birthday, paid year-end holiday shutdown, and 4 paid days off for personal wellness determined by Cisco
Non-exempt employees receive 16 days of paid vacation time per full calendar year, accrued at rate of 4.92 hours per pay period for full-time employees
Exempt employees participate in Cisco’s flexible vacation time off program, which has no defined limit on how much vacation time eligible employees may use (subject to availability and some business limitations)
80 hours of sick time off provided on hire date and each January 1st thereafter, and up to 80 hours of unused sick time carried forward from one calendar year to the next
Additional paid time away may be requested to deal with critical or emergency issues for family members
Optional 10 paid days per full calendar year to volunteer
For non-sales roles, employees are also eligible to earn annual bonuses subject to Cisco’s policies.