Principal Engineer, Inference

CoreWeave•Sunnyvale, CA

9h•$206,000 - $303,000•Hybrid

About The Position

We’re seeking a Principal Engineer to serve as the hands-on technical leader for our next-generation Inference Platform . As a senior individual contributor, you will architect and build the fastest, most cost-effective, and most reliable GPU inference services in the industry. You’ll prototype new capabilities, drive engineering standards, and work shoulder-to-shoulder with engineering, product, orchestration, and hardware teams to make CoreWeave the best place on earth to serve frontier models in production. About the role: Technical Vision & Strategy - Define the technical roadmap for ultra-low-latency, high-throughput inference. Evaluate and influence adoption of runtimes and frameworks (Triton, vLLM, TensorRT-LLM, Ray Serve, TorchServe) and guide build-vs-buy decisions. Platform Architecture - Design Kubernetes-native control-plane components that deploy, autoscale, and monitor fleets of model-server pods spanning thousands of GPUs. Implement advanced optimizations: micro-batching, speculative decoding, KV-cache reuse, early-exit heuristics, tensor/stream parallel inference, to squeeze every microsecond out of large-model serving. Build intelligent request routing and adaptive scheduling to maximize GPU utilization while guaranteeing strict P99 latency SLAs. Operational Excellence - Create real-time observability, live debugging hooks, and automated rollback/traffic-shift for model versioning.Develop cost-per-token and cost-per-request analytics so customers can instantly select the ideal hardware tier. Hands-on Development - Write production code, reference implementations, and performance benchmarks across gRPC/HTTP, CUDA Graphs, and NCCL/SHARP fast-paths. Lead deep-dive investigations into network, PCIe, NVLink, and memory-bandwidth bottlenecks. Mentorship & Collaboration - Coach engineers on large-scale inference best practices and performance profiling. Partner with lighthouse customers to launch and optimize mission-critical, real-time AI applications.

Requirements

10+ years building distributed systems or HPC/cloud services, with 4+ years focused on real-time ML inference or other latency-critical data planes.
Demonstrated expertise in micro-batch schedulers, GPU resource isolation, KV caching, speculative decoding, and mixed precision (BF16/FP8) inference
Deep knowledge of PyTorch or TensorFlow serving internals , CUDA kernels, NCCL/SHARP, RDMA, NUMA, and GPU interconnect topologies.
Proven track record of driving sub-50 ms global P99 latencies and optimizing cost-per-token / cost-per-request on multi-node GPU clusters.
Fluency with Kubernetes (or Slurm/Ray) at production scale plus CI/CD, service meshes, and observability stacks (Prometheus, Grafana, OpenTelemetry).
Excellent communicator who influences architecture across teams and presents complex trade-offs to executives and customers.
Bachelor’s or Master’s in CS, EE, or related field (or equivalent practical experience).

Nice To Haves

Code contributions to open-source inference frameworks (vLLM, Triton, Ray Serve, TensorRT-LLM, TorchServe).
Experience operating multi-region inference fleets or streaming-token services at a hyperscaler or AI research lab.
Publications/talks on latency optimization, token streaming, or advanced model-server architectures.

Responsibilities

Define the technical roadmap for ultra-low-latency, high-throughput inference.
Evaluate and influence adoption of runtimes and frameworks (Triton, vLLM, TensorRT-LLM, Ray Serve, TorchServe) and guide build-vs-buy decisions.
Design Kubernetes-native control-plane components that deploy, autoscale, and monitor fleets of model-server pods spanning thousands of GPUs.
Implement advanced optimizations: micro-batching, speculative decoding, KV-cache reuse, early-exit heuristics, tensor/stream parallel inference, to squeeze every microsecond out of large-model serving.
Build intelligent request routing and adaptive scheduling to maximize GPU utilization while guaranteeing strict P99 latency SLAs.
Create real-time observability, live debugging hooks, and automated rollback/traffic-shift for model versioning.
Develop cost-per-token and cost-per-request analytics so customers can instantly select the ideal hardware tier.
Write production code, reference implementations, and performance benchmarks across gRPC/HTTP, CUDA Graphs, and NCCL/SHARP fast-paths.
Lead deep-dive investigations into network, PCIe, NVLink, and memory-bandwidth bottlenecks.
Coach engineers on large-scale inference best practices and performance profiling.
Partner with lighthouse customers to launch and optimize mission-critical, real-time AI applications.