About The Position

Join our Platform Engineering Team of experienced Kubernetes engineers who design, build, and operate large-scale on-premises Kubernetes environments. Our mission is to deliver a highly reliable, scalable, and GPU-enabled platform to support AI/ML workloads, while applying intelligent automation (AIOps) to improve platform operations. As part of this team, you will directly manage the Kubernetes control plane, extend platform capabilities via controllers and operators, and implement automation to detect, predict, and self-heal operational issues. Candidates must have hands-on, on-prem control plane experience and able to work within a hybrid work model on site, as needed

Requirements

  • 5+ years of software engineering experience
  • 3+ years operating Kubernetes in production with hands-on control plane experience
  • Experience managing etcd (backup, restore, recovery) and performing control plane upgrades
  • Strong Go programming skills
  • Experience building Kubernetes operators/controllers and developing CRDs/webhooks
  • Deep understanding of scheduler, API server, controller loops, and reconciliation
  • Experience debugging and troubleshooting large-scale distributed systems
  • Candidates without on-prem or self-managed Kubernetes control plane experience will not be considered.

Nice To Haves

  • Experience in bare-metal or on-prem infrastructure
  • Experience supporting GPU-enabled workloads in Kubernetes
  • Exposure to building internal developer platforms
  • Contributions to CNCF or Kubernetes open-source projects
  • Hands-on experience with AI/ML-assisted operational automation (AIOps)
  • Experience applying statistical or ML techniques to operational data for platform reliability

Responsibilities

  • Design, build, and operate self-managed Kubernetes clusters (OpenShift / Anthos)
  • Manage and maintain etcd (backup, restore, quorum management, defrag)
  • Perform control plane upgrades and lifecycle management
  • Tune API server, scheduler, and controller manager for performance and reliability
  • Debug node-level and control-plane issues across large clusters
  • Implement networking (CNI), storage (CSI), and ingress integrations
  • Implement and extend runbook automation frameworks to reduce operational toil
  • Integrate AI agents that monitor cluster telemetry, detect anomalies, and trigger automated workflows (e.g., Slack notifications, remediation scripts)
  • Apply statistical or ML-based models on operational data from Splunk, Prometheus, and Kubernetes to predict failures, capacity saturation, or workload misbehavior
  • Build self-healing controllers and automated remediation pipelines
  • Implement predictive capacity planning and intelligent alert suppression workflows
  • Build Kubernetes controllers and operators (Go + controller-runtime)
  • Develop CRDs and admission webhooks to extend platform functionality
  • Automate cluster lifecycle and multi-cluster operations
  • Implement policies for workload isolation, governance, and compliance
  • Enable GPU and high-performance infrastructure for AI/ML workloads
  • Optimize scheduler and resource allocation for memory- and compute-intensive workloads
  • Support orchestration of AI/ML pipelines

Benefits

  • U.S. employees are offered benefits, subject to Cisco’s plan eligibility rules, which include medical, dental and vision insurance, a 401(k) plan with a Cisco matching contribution, paid parental leave, short and long-term disability coverage, and basic life insurance.
  • Employees may be eligible to receive grants of Cisco restricted stock units, which vest following continued employment with Cisco for defined periods of time.
  • 10 paid holidays per full calendar year, plus 1 floating holiday for non-exempt employees
  • 1 paid day off for employee’s birthday, paid year-end holiday shutdown, and 4 paid days off for personal wellness determined by Cisco
  • Non-exempt employees receive 16 days of paid vacation time per full calendar year, accrued at rate of 4.92 hours per pay period for full-time employees
  • Exempt employees participate in Cisco’s flexible vacation time off program, which has no defined limit on how much vacation time eligible employees may use (subject to availability and some business limitations)
  • 80 hours of sick time off provided on hire date and each January 1st thereafter, and up to 80 hours of unused sick time carried forward from one calendar year to the next
  • Additional paid time away may be requested to deal with critical or emergency issues for family members
  • Optional 10 paid days per full calendar year to volunteer
  • For non-sales roles, employees are also eligible to earn annual bonuses subject to Cisco’s policies.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service