Principal Engineer, Inference

Core Weave•Sunnyvale, CA

78d•$206,000 - $303,000•Hybrid

About The Position

CoreWeave is the AI Hyperscaler, delivering a cloud platform of cutting edge services powering the next wave of AI. Our technology provides enterprises and leading AI labs with the most performant, efficient and resilient solutions for accelerated computing. Since 2017, CoreWeave has operated a growing footprint of data centers covering every region of the US and across Europe. CoreWeave was ranked as one of the TIME100 most influential companies of 2024. As the leader in the industry, we thrive in an environment where adaptability and resilience are key. Our culture offers career-defining opportunities for those who excel amid change and challenge. If you're someone who thrives in a dynamic environment, enjoys solving complex problems, and is eager to make a significant impact, CoreWeave is the place for you. Join us, and be part of a team solving some of the most exciting challenges in the industry. CoreWeave powers the creation and delivery of the intelligence that drives innovation.

Requirements

10+ years building distributed systems or HPC/cloud services, with 4+ years focused on real-time ML inference or other latency-critical data planes.
Demonstrated expertise in micro-batch schedulers, GPU resource isolation, KV caching, speculative decoding, and mixed precision (BF16/FP8) inference.
Deep knowledge of PyTorch or TensorFlow serving internals, CUDA kernels, NCCL/SHARP, RDMA, NUMA, and GPU interconnect topologies.
Proven track record of driving sub-50 ms global P99 latencies and optimizing cost-per-token / cost-per-request on multi-node GPU clusters.
Fluency with Kubernetes (or Slurm/Ray) at production scale plus CI/CD, service meshes, and observability stacks (Prometheus, Grafana, OpenTelemetry).
Excellent communicator who influences architecture across teams and presents complex trade-offs to executives and customers.
Bachelor's or Master's in CS, EE, or related field (or equivalent practical experience).

Nice To Haves

Code contributions to open-source inference frameworks (vLLM, Triton, Ray Serve, TensorRT-LLM, TorchServe).
Experience operating multi-region inference fleets or streaming-token services at a hyperscaler or AI research lab.
Publications/talks on latency optimization, token streaming, or advanced model-server architectures.

Responsibilities

Define the technical roadmap for ultra-low-latency, high-throughput inference.
Evaluate and influence adoption of runtimes and frameworks (Triton, vLLM, TensorRT-LLM, Ray Serve, TorchServe) and guide build-vs-buy decisions.
Design Kubernetes-native control-plane components that deploy, autoscale, and monitor fleets of model-server pods spanning thousands of GPUs.
Implement advanced optimizations: micro-batching, speculative decoding, KV-cache reuse, early-exit heuristics, tensor/stream parallel inference.
Build intelligent request routing and adaptive scheduling to maximize GPU utilization while guaranteeing strict P99 latency SLAs.
Create real-time observability, live debugging hooks, and automated rollback/traffic-shift for model versioning.
Develop cost-per-token and cost-per-request analytics so customers can instantly select the ideal hardware tier.
Write production code, reference implementations, and performance benchmarks across gRPC/HTTP, CUDA Graphs, and NCCL/SHARP fast-paths.
Lead deep-dive investigations into network, PCIe, NVLink, and memory-bandwidth bottlenecks.
Coach engineers on large-scale inference best practices and performance profiling.
Partner with lighthouse customers to launch and optimize mission-critical, real-time AI applications.

Benefits

Medical, dental, and vision insurance - 100% paid for by CoreWeave.
Company-paid Life Insurance.
Voluntary supplemental life insurance.
Short and long-term disability insurance.
Flexible Spending Account.
Health Savings Account.
Tuition Reimbursement.
Ability to Participate in Employee Stock Purchase Program (ESPP).
Mental Wellness Benefits through Spring Health.
Family-Forming support provided by Carrot.
Paid Parental Leave.
Flexible, full-service childcare support with Kinside.
401(k) with a generous employer match.
Flexible PTO.
Catered lunch each day in our office and data center locations.
A casual work environment.
A work culture focused on innovative disruption.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Professional, Scientific, and Technical Services

Education Level

Master's degree

Principal Engineer, Inference

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company