Founding Engineer - ML Performance

uRun•United States, CA

4d•$250,000 - $395,000

About The Position

The problem we saw: Most AI infrastructure is built for batch: send a query, wait, get a response, reset. Powerful, but transactional. AI is becoming interactive — sessions that hold state, models that stay alive between turns, generation that responds as it runs — and the infrastructure to deliver that at scale doesn't really exist yet. The bottleneck isn't the models anymore. It's the infrastructure underneath them. What we're building to fix it: uRun is the inference cloud for interactive AI: the compute layer that makes real-time, stateful inference possible at scale. We came out of stealth in April 2026, are backed by top-tier investors, and are founded by Keegan McCallum, who scaled inference infrastructure for some of the most demanding generative AI workloads in production. We're an infrastructure company. We build the layer that model labs, builders, and research teams ship on top of.

Requirements

Deep, hands-on CUDA expertise: you have written custom kernels in production, not just called into cuBLAS
Strong background in model inference and post-training optimization at scale
Fluency in GPU memory hierarchy, warp scheduling, kernel fusion, and hardware-aware algorithm design
Experience profiling and benchmarking complex inference pipelines: you know where the time goes and how to get it back
Able to operate at the frontier with minimal guidance — you identify the problem, design the approach, and ship the fix

Nice To Haves

Public work in GPU optimization or inference efficiency — open source contributions, a published paper, or a side project that shows your depth (vLLM, Flash-Attention, TensorRT-LLM, PyTorch, or equivalent)
Experience with hardware-aware optimization frameworks: CuTe, Triton, TileLang, or similar
Familiarity with distributed memory and communication primitives: NCCL, InfiniBand, NVLink, RoCE
Contributions to or deep familiarity with PyTorch Distributed, Ray core, or similar systems
Experience optimizing for video generation or other high-throughput, latency-sensitive generative workloads
Prior work at an inference-focused company or research lab pushing the boundary of what GPU hardware can do

Responsibilities

Write custom CUDA kernels that unlock performance headroom unavailable through off-the-shelf frameworks
Optimize model inference end-to-end, targeting sub-50ms latency across our inference platform
Drive 10x performance improvements across the stack: memory bandwidth, kernel fusion, operator scheduling, and beyond
Implement zero-copy distributed memory optimizations across multi-GPU and multi-node environments
Own GPU utilization and memory management, squeezing every available FLOP out of the hardware we run
Profile, benchmark, and instrument the full inference pipeline to find and eliminate bottlenecks systematically
Set the performance engineering bar for the team: define what fast looks like and build the tooling to measure it

Benefits

Competitive salary and meaningful equity in an early-stage AI infrastructure company.
Health, dental, and vision — full coverage
401(k) — company-supported retirement savings
FSA/HSA — flexible spending accounts for healthcare costs
Paid time off — we trust you to manage your time
Top-tier tooling — access to the best AI tools available: Claude, Codex, Kimi, and whatever else helps you move faster
MacBook Pro and AirPods — the hardware you need, on us

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume