Principal Engineer, Inference

CoreWeaveSunnyvale, CA
9h$206,000 - $303,000Hybrid

About The Position

We’re seeking a Principal Engineer to serve as the hands-on technical leader for our next-generation Inference Platform . As a senior individual contributor, you will architect and build the fastest, most cost-effective, and most reliable GPU inference services in the industry. You’ll prototype new capabilities, drive engineering standards, and work shoulder-to-shoulder with engineering, product, orchestration, and hardware teams to make CoreWeave the best place on earth to serve frontier models in production. About the role: Technical Vision & Strategy - Define the technical roadmap for ultra-low-latency, high-throughput inference. Evaluate and influence adoption of runtimes and frameworks (Triton, vLLM, TensorRT-LLM, Ray Serve, TorchServe) and guide build-vs-buy decisions. Platform Architecture - Design Kubernetes-native control-plane components that deploy, autoscale, and monitor fleets of model-server pods spanning thousands of GPUs. Implement advanced optimizations: micro-batching, speculative decoding, KV-cache reuse, early-exit heuristics, tensor/stream parallel inference, to squeeze every microsecond out of large-model serving. Build intelligent request routing and adaptive scheduling to maximize GPU utilization while guaranteeing strict P99 latency SLAs. Operational Excellence - Create real-time observability, live debugging hooks, and automated rollback/traffic-shift for model versioning.Develop cost-per-token and cost-per-request analytics so customers can instantly select the ideal hardware tier. Hands-on Development - Write production code, reference implementations, and performance benchmarks across gRPC/HTTP, CUDA Graphs, and NCCL/SHARP fast-paths. Lead deep-dive investigations into network, PCIe, NVLink, and memory-bandwidth bottlenecks. Mentorship & Collaboration - Coach engineers on large-scale inference best practices and performance profiling. Partner with lighthouse customers to launch and optimize mission-critical, real-time AI applications.

Requirements

  • 10+ years building distributed systems or HPC/cloud services, with 4+ years focused on real-time ML inference or other latency-critical data planes.
  • Demonstrated expertise in micro-batch schedulers, GPU resource isolation, KV caching, speculative decoding, and mixed precision (BF16/FP8) inference
  • Deep knowledge of PyTorch or TensorFlow serving internals , CUDA kernels, NCCL/SHARP, RDMA, NUMA, and GPU interconnect topologies.
  • Proven track record of driving sub-50 ms global P99 latencies and optimizing cost-per-token / cost-per-request on multi-node GPU clusters.
  • Fluency with Kubernetes (or Slurm/Ray) at production scale plus CI/CD, service meshes, and observability stacks (Prometheus, Grafana, OpenTelemetry).
  • Excellent communicator who influences architecture across teams and presents complex trade-offs to executives and customers.
  • Bachelor’s or Master’s in CS, EE, or related field (or equivalent practical experience).

Nice To Haves

  • Code contributions to open-source inference frameworks (vLLM, Triton, Ray Serve, TensorRT-LLM, TorchServe).
  • Experience operating multi-region inference fleets or streaming-token services at a hyperscaler or AI research lab.
  • Publications/talks on latency optimization, token streaming, or advanced model-server architectures.

Responsibilities

  • Define the technical roadmap for ultra-low-latency, high-throughput inference.
  • Evaluate and influence adoption of runtimes and frameworks (Triton, vLLM, TensorRT-LLM, Ray Serve, TorchServe) and guide build-vs-buy decisions.
  • Design Kubernetes-native control-plane components that deploy, autoscale, and monitor fleets of model-server pods spanning thousands of GPUs.
  • Implement advanced optimizations: micro-batching, speculative decoding, KV-cache reuse, early-exit heuristics, tensor/stream parallel inference, to squeeze every microsecond out of large-model serving.
  • Build intelligent request routing and adaptive scheduling to maximize GPU utilization while guaranteeing strict P99 latency SLAs.
  • Create real-time observability, live debugging hooks, and automated rollback/traffic-shift for model versioning.
  • Develop cost-per-token and cost-per-request analytics so customers can instantly select the ideal hardware tier.
  • Write production code, reference implementations, and performance benchmarks across gRPC/HTTP, CUDA Graphs, and NCCL/SHARP fast-paths.
  • Lead deep-dive investigations into network, PCIe, NVLink, and memory-bandwidth bottlenecks.
  • Coach engineers on large-scale inference best practices and performance profiling.
  • Partner with lighthouse customers to launch and optimize mission-critical, real-time AI applications.

Benefits

  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service