Founding ML infrastructure Engineer

uRun•United States, CA

6d•$200,000 - $350,000

About The Position

uRun is building the next generation of AI inference infrastructure, focusing on the compute layer that makes real-time, stateful inference possible at scale. As a founding ML Infrastructure and Platform Engineer, you will own the architecture and scaling of our GPU compute platform from the ground up. This is a founding technical hire with end-to-end ownership across the full infrastructure stack, from bare metal to model serving. You will work directly with the founding team and define how we build.

Requirements

Proven experience designing and operating large-scale distributed infrastructure at 1,000+ nodes or equivalent complexity, in any domain
Deep expertise in distributed systems, cluster orchestration (Kubernetes, Slurm, or custom schedulers), and large-scale resource scheduling
Strong production reliability instincts: observability, incident response, capacity planning, and SLA ownership across complex systems
Experience building infrastructure that other engineers build on top of, not just operating it
Ability to operate as a technical lead: set direction, make tradeoffs under uncertainty, and raise the bar for the team around you
Startup orientation. You are energised by ambiguity, move fast, and build for scale from day one

Nice To Haves

Exposure to ML infrastructure concepts: GPU networking (NCCL, InfiniBand, RoCE), model serving frameworks (vLLM, SGLang, TensorRT-LLM), or hardware-aware performance tuning (CuTe, Triton, TileLang)
Experience with multi-cloud GPU procurement and capacity management across AWS, GCP, Azure, and bare metal providers
Familiarity with inference marketplace architectures, dynamic routing, or spot/preemptible workload management
Prior experience at a Series A or earlier stage company scaling from early infrastructure to production

Responsibilities

Design and scale our GPU compute platform to support 1,000+ GPU clusters, ensuring high availability and low-latency inference across the fleet
Build and maintain the infrastructure layer for our compute marketplace, including multi-tenant scheduling, isolation, and billing-aware resource allocation
Own production reliability for ML systems end-to-end: observability, incident response, and SLA achievement across model serving and infrastructure
Architect feature stores and model registry systems that support rapid iteration and reproducibility at scale
Design an experiment tracking infrastructure capable of handling thousands of concurrent runs with full auditability
Build resource orchestration and scheduling systems that optimise for throughput, cost, and latency across heterogeneous hardware
Set engineering standards for infrastructure reliability, capacity planning, and operational excellence as an early technical leader

Benefits

Competitive salary and meaningful equity
Health, dental, and vision — full coverage
401(k) — company-supported retirement savings
FSA/HSA — flexible spending accounts for healthcare costs
Paid time off — we trust you to manage your time
Top-tier tooling — access to the best AI tools available: Claude, Codex, Kimi, and whatever else helps you move faster
MacBook Pro and AirPods — the hardware you need, on us

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume