Director of Engineering, Inference Services

CoreWeave•Sunnyvale, CA

55d•$206,000 - $303,000•Hybrid

About The Position

CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com . About this Role: CoreWeave is looking for a Director of Engineering to own and scale our next-generation Inference Platform. In this highly technical, strategic role you will lead a world-class engineering organization to design, build, and operate the fastest, most cost-efficient, and most reliable GPU inference services in the industry. Your charter spans everything from model-serving runtimes (e.g., Triton, vLLM, TensorRT-LLM) and autoscaling micro-batch schedulers to developer-friendly SDKs and airtight, multi-tenant security - all delivered on CoreWeave’s unique accelerated-compute infrastructure.

Requirements

10+ years building large-scale distributed systems or cloud services, with 5+ years leading multiple engineering teams.
Proven success delivering mission-critical model-serving or real-time data-plane services (e.g., Triton, TorchServe, vLLM, Ray Serve, SageMaker Inference, GCP Vertex Prediction).
Deep understanding of GPU/CPU resource isolation, NUMA-aware scheduling, micro-batching, and low-latency networking (gRPC, QUIC, RDMA).
Track record of optimizing cost-per-token / cost-per-request and hitting sub-100 ms global P99 latencies.
Expertise in Kubernetes, service meshes, and CI/CD for ML workloads; familiarity with Slurm, Kueue, or other schedulers a plus.
Hands-on experience with LLM optimization (quantization, compilation, tensor parallelism, speculative decoding) and hardware-aware model compression.
Excellent communicator who can translate deep technical concepts into clear business value for C-suite and engineering audiences.
Bachelor’s or Master’s in CS, EE, or related field (or equivalent practical experience).

Nice To Haves

Experience operating multi-region inference fleets at a cloud provider or hyperscaler.
Contributions to open-source inference or MLOps projects.
Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry) for AI workloads.
Background in edge inference , streaming inference, or real-time personalization systems.

Responsibilities

Vision & Roadmap - Define and continuously refine the end-to-end Inference Platform roadmap, prioritizing low-latency, high-throughput model serving and world-class developer UX. Set technical standards for runtime selection, GPU/CPU heterogeneity, quantization, and model-optimization techniques.
Platform Architecture - Design and implement a global, Kubernetes-native inference control plane that delivers <50 ms P99 latencies at scale. Build adaptive micro-batching, request-routing, and autoscaling mechanisms that maximize GPU utilization while meeting strict SLAs. Integrate model-optimization pipelines (TensorRT, ONNX Runtime, BetterTransformer, AWQ, etc.) for frictionless deployment. Implement state-of-the-art runtime optimizations —including speculative decoding, KV-cache reuse across batches, early-exit heuristics, and tensor-parallel streaming—to squeeze every microsecond out of LLM inference while retaining accuracy.
Operational Excellence - Establish SLOs/SLA dashboards, real-time observability, and self-healing mechanisms for thousands of models across multiple regions. Drive cost-performance trade-off tooling that makes it trivial for customers to choose the best HW tier for each workload.
Leadership - Hire, mentor, and grow a diverse team of engineers and managers passionate about large-scale AI inference. Foster a customer-obsessed, metrics-driven engineering culture with crisp design reviews and blameless post-mortems.
Collaboration - Partner closely with Product, Orchestration, Networking, and Security teams to deliver a unified CoreWeave experience. Engage directly with flagship customers (internal and external) to gather feedback and shape the roadmap.