Technical Service Delivery Manager

Mirantis

21d

About The Position

The Service Delivery Manager (SDM) is a deeply technical, customer-facing leader responsible for ensuring world-class support delivery, operational alignment, and proactive technical guidance across Mirantis’ enterprise customer base. This hybrid role combines the technical depth of a Level 2 Support Engineer, the relationship skills of a Customer Success Manager or Account Executive, and the ownership mindset of a Technical Lead responsible for the customer’s full Mirantis platform stack. The SDM serves as the primary operational and technical owner for assigned accounts—leading escalations, guiding platform operations, and ensuring customers achieve maximum value from Mirantis technologies including Mirantis Kubernetes Engine(MKE), Mirantis Container Runtime (MCR), k0rdent, Lens, and OpenStack.

Requirements

5+ years in technical support, service delivery, DevOps, SRE, or similar customer-facing infrastructure roles.
Hands-on experience with Kubernetes and OpenStack in production environments.
Understanding of Nova, Neutron, Cinder, Keystone, Glance, Ceph, iSCSI, NFS, VLAN/VXLAN, and SDN concepts.
Strong Linux administration and distributed systems troubleshooting skills.
Proven experience managing high-severity escalations and communicating with executives.
Willingness to travel up to 30% and participate in an on-call rotation.
OpenStack Architecture Literacy: Understanding Nova, Neutron, Cinder, Glance, Keystone and common failure patterns such as DHCP or metadata failures, RabbitMQ quorum issues, and Ceph latency impacting Nova.
Cloud Control Plane & HA Concepts: Awareness of Galera clustering, HA control plane behavior, how outages manifest for customers, and sound judgment on triage and escalation.
MOSK Awareness: Understanding Mirantis OpenStack for Kubernetes (MOSK) release cycles, containerized services, StackLight monitoring, and the differences between platform upgrades, host OS lifecycle, and kernel upgrades, including EOL risk.
Ceph & Storage Fundamentals: High-level understanding of Ceph as block storage, how storage latency affects VM performance, and what OSD flapping, health warnings, or degraded clusters indicate.
Neutron Networking Basics: Familiarity with provider versus tenant networks, floating versus internal IPs, and typical causes of east/west traffic failures, metadata issues, and DHCP problems.
Incident Management Excellence: Ability to lead outage calls, structure communication, distinguish root cause from contributing factors, and drive follow-through to full resolution.
Escalation Ownership: Skill in routing issues to the correct teams (OpenStack, Ceph/storage, networking, hardware, infrastructure) and maintaining accountability.
Customer & Executive Communication: Ability to clearly explain what the issue is, why it matters, the risks, and next steps, maintaining trust and confidence.
Lifecycle & Capacity Planning: Awareness of when customers are approaching software end-of-life, resource exhaustion, host overcommit, or aging hardware risk, and ability to recommend remediation

Nice To Haves

Experience running Kubernetes on top of OpenStack.
Exposure to CI/CD, infrastructure-as-code, and observability tooling.
Awareness of GPU and AI/ML workloads and their infrastructure requirements.
Certifications such as CKA, CKAD, CKS, OpenStack certifications, Docker, Linux Foundation, or cloud provider certifications.
Kubernetes Literacy: Understanding that many customers run Kubernetes on top of OpenStack, with basic familiarity with CSI, CNI, nodes, and pods to understand cross-platform dependencies.
Performance Concepts: High-level awareness of CPU pinning, hugepages, and NUMA topology, sufficient to contextualize engineering recommendations.
Broader Infrastructure Context: Familiarity with Octavia (load balancers), Designate (DNS), high availability patterns, and multi-region or FedRAMP-style constraints for regulated customers.
Monitoring Awareness: Basic knowledge of StackLight dashboards and common alerts such as Ceph health issues, API failures, and RabbitMQ quorum problems to help interpret severity.

Responsibilities

Serve as the primary technical authority for customer environments across Kubernetes, OpenStack, Linux, networking, storage, and security.
Provide L2-level troubleshooting and technical guidance across compute, control plane, networking, and storage layers.
Diagnose complex failures across OpenStack and Kubernetes components.
Guide customers through upgrades, lifecycle management, capacity planning, and architecture best practices.
Ensure customer issues are resolved within defined SLAs with minimal business impact.
Maintain greater than 95% CSAT across assigned accounts.
Conduct proactive platform reviews and drive root cause elimination for recurring issues.
Lead P1/P0 escalations, war rooms, and cross-functional incident response.
Provide clear, timely updates to customers and internal stakeholders throughout the incident lifecycle.
Prevent repeat incidents through structured RCA and strategic improvements.
Conduct recurring platform health reviews and risk assessments.
Drive modernization initiatives and adoption of MKE, MCR, k0rdent, and related Mirantis technologies.
Partner closely with Customer Success Managers on retention, renewals, and expansion opportunities.

Benefits

Work with an established Silicon Valley leader in the cloud infrastructure industry.
Work with exceptionally passionate, talented and engaging colleagues, helping Fortune 500 and Global 2000 customers implement next-generation cloud technologies.
Be a part of cutting-edge, open-source innovation.
Thrive in the high-energy environment of a young company where openness, collaboration, risk-taking, and continuous growth are valued.
Receive a competitive compensation package with strong benefits plan

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume