Product Manager - AI Inference & Model Serving

Mirantis•Austin, TX

49d•Remote

About The Position

Mirantis is seeking a commercially driven, deeply technical Product Manager to lead AI inference and model serving for k0rdent AI, their control plane for GPU infrastructure and distributed AI workloads. This role is at the intersection of AI inference, cloud-native infrastructure, distributed systems, and performance engineering. The Product Manager will define how customers deploy, scale, and operate production inference services while maximizing performance from underlying GPU, network, and storage infrastructure. This role is responsible for product strategy and solution development for inference products across on-premises, cloud, and edge environments. The scope includes serverless inference, dedicated endpoints, workload placement, autoscaling, routing, lifecycle management, observability, and full-stack performance optimization. The goal is to define how customers run production model-serving workloads at scale while improving latency, throughput, utilization, reliability, cost, and operational control. The ideal candidate will have experience with high-performance infrastructure products, understand production systems under real-world load, be comfortable reasoning across the full stack, identify performance bottlenecks, evaluate system design trade-offs, and translate technical insights into clear product requirements, architecture direction, and customer-facing solutions.

Requirements

7+ years in product management, technical product management, or a senior technical role owning AI/ML and inference product(s)
Strong understanding of production AI inference, including model serving, serverless execution, dedicated endpoints, autoscaling, routing, workload placement, observability, and reliability
Proven capability to reason about performance trade-offs across GPU, network, storage, orchestration, and runtime layers, and to translate low-level technical capability into business value such as TTFT, throughput per GPU, and TCO
Working knowledge of modern inference runtimes (vLLM, SGLang, TensorRT-LLM, Dynamo, Triton) and the optimization patterns that matter in production: continuous batching, KV cache management, cold starts, prefill versus decode, disaggregated serving, and multi-model serving
Credibility with engineering leaders and infrastructure operators, including comfort in production architecture reviews and technical commercial conversations with platform engineering buyers

Responsibilities

Own product strategy, roadmap, and lifecycle for inference and model serving, including serverless inference, dedicated endpoints, autoscaling, routing, KV cache management, and the related observability
Lead deep technical discovery with NeoClouds, sovereign clouds, and enterprise platform teams, and translate findings into prioritized requirements and architecture direction
Partner with engineering on system design trade-offs across runtime integration, GPU scheduling, network, storage, and serving topology, including disaggregated serving and multi-model serving
Define positioning grounded in measurable outcomes: latency distributions, throughput per GPU, utilization, tail reliability, and cost per tokens
Drive go-to-market execution: pricing and packaging, reference architectures, sizing guides, PoC playbooks, and direct engagement with customers, analysts, and ecosystem partners