Senior Engineering Manager - Next-generation Kubernetes platform

Nutanix•San Jose, CA

1d•$195,200 - $391,200•Hybrid

About The Position

We are looking for a Senior Engineering Manager to lead the design, development, and scaling of a next-generation Kubernetes platform powering enterprise environments. This platform will serve as the foundation for AI/ML workloads, GPU infrastructure, and enterprise applications, delivering hyperscaler-like capabilities in on-prem and hybrid deployments. You will lead a team responsible for building a production-grade, globally scalable Kubernetes platform, including cluster lifecycle, fleet management, multi-tenancy, and deep integration with compute (CPU/GPU), networking, and storage systems.

Requirements

Proven experience leading and scaling high-performing engineering teams
Ability to drive clarity, ownership, and execution in complex, ambiguous problem spaces
Strong understanding of distributed systems at scale
Hands-on familiarity with cloud platforms, infrastructure systems, or PaaS offerings
Experience building large, meaningful production systems (cloud platforms, infrastructure, or PaaS)
Platform & Systems Thinking: Experience designing multi-tenant platforms with clear abstractions (projects, quotas, policies)
Familiarity with multi-cluster / fleet management and large-scale system design
Ability to balance long-term architecture with near-term delivery
Track record of delivering reliable, production-grade systems
Experience with SLOs, observability, incident management, and lifecycle operations
Strong ability to work across product, hardware, and field teams
Effective executive-level communication and stakeholder management

Nice To Haves

Kubernetes experience is desirable, but not required—we welcome leaders who are excited to learn Kubernetes deeply and apply strong systems fundamentals to this domain
Exposure to AI/ML workloads or GPU-based systems is a plus
Strong platform engineers who are excited to grow into AI infrastructure—this role offers the opportunity to learn and build in the rapidly evolving space of GPU scheduling, training, and inference systems

Responsibilities

Own end-to-end delivery of key platform capabilities, including cluster lifecycle, fleet management, and multi-tenancy
Drive the design of large-scale distributed systems, evolving toward global control planes and cell-based architectures
Lead a team of engineers to build AI-native infrastructure, including GPU-aware scheduling, resource isolation, and workload orchestration
Partner closely with Product and cross-functional teams to translate enterprise and AI use cases into platform capabilities
Establish a strong operational excellence culture, including SLOs, reliability engineering, and production readiness
Simplify complex infrastructure into intuitive, consumable platform experiences for enterprise users