Staff Software Engineer, Inference

CoreWeave•Sunnyvale, CA

3h•Hybrid

About The Position

The Inference team builds and operates CoreWeave’s Kubernetes-native inference platform, powering low-latency, high-throughput AI workloads at massive scale. The team is responsible for request routing, scheduling, GPU resource management, and system-wide optimizations that drive performance, efficiency, and reliability across real-time inference systems. As a Staff Software Engineer (IC5) on the Inference team, you will act as a technical leader driving architecture, performance, and reliability across multiple services and teams. Your day-to-day will involve leading cross-team design initiatives, optimizing inference performance (latency, throughput, and GPU utilization), and improving system reliability at scale. You will work deeply in distributed systems and Kubernetes-based infrastructure, focusing on areas like scheduling, batching, and memory optimization. This role requires hands-on technical leadership and the ability to influence engineering direction across the organization.

Requirements

8–12+ years of experience building and operating large-scale distributed systems or cloud platforms
Proven experience leading cross-team technical initiatives impacting multiple services or organizations
Strong programming skills in Go, Python, or C++
Deep expertise in Kubernetes at production scale, including orchestration, scheduling, and service design
Strong understanding of distributed systems, networking, and performance optimization
Experience designing and operating low-latency, high-throughput systems with strict P95/P99 latency requirements
Hands-on experience with inference systems, including batching or micro-batching strategies, caching, and memory optimization
Experience improving system performance using metrics-driven approaches (e.g., latency, throughput, utilization)
Familiarity with mixed precision (BF16, FP8) and streaming inference workloads

Nice To Haves

Experience with inference frameworks such as vLLM, Triton, TensorRT-LLM, Ray Serve, or TorchServe
Experience with GPU systems and performance optimization (CUDA, NCCL, RDMA, NUMA, GPU interconnects)
Experience leading multi-team or org-level technical initiatives
Exposure to large-scale AI/ML infrastructure or hyperscale cloud environments

Responsibilities

Act as a technical leader driving architecture, performance, and reliability across multiple services and teams.
Lead cross-team design initiatives.
Optimize inference performance (latency, throughput, and GPU utilization).
Improve system reliability at scale.
Work deeply in distributed systems and Kubernetes-based infrastructure, focusing on areas like scheduling, batching, and memory optimization.