Founding Engineer - Site Reliability

uRun•United States, CA

4d•$185,000 - $285,000•Remote

About The Position

uRun is building the inference cloud for interactive AI, focusing on the compute layer that enables real-time, stateful inference at scale. As a founding Site Reliability Engineer, you will be instrumental in establishing the reliability culture from the ground up. This includes defining the observability stack, incident response playbooks, SLOs, and the on-call process. You will collaborate closely with infrastructure and platform engineers to ensure the stability and performance of the inference platform.

Requirements

7+ years in site reliability, production engineering, or infrastructure engineering in a high-availability, low-latency environment.
Deep experience owning SLOs, error budgets, and on-call processes in production at scale.
Strong observability background: you have built or owned monitoring stacks (Prometheus, Grafana, Datadog, or equivalent) and know what good alerting looks like.
Proven incident response experience: you have led real incidents under pressure and written postmortems that actually changed behaviour.
Hands-on with Kubernetes and cloud infrastructure (AWS preferred): you can debug a failing pod and a misconfigured VPC in the same afternoon.
Strong software engineering fundamentals: you write automation, not just runbooks.
Comfortable operating as the first and only SRE, setting standards without a template to follow.

Nice To Haves

Experience supporting GPU compute or ML inference infrastructure in production.
Familiarity with stateful workloads, long-running sessions, or streaming inference systems.
Exposure to multi-tenant platforms where isolation, noisy neighbour problems, and billing-aware scheduling matter.
Prior founding or sole SRE experience at an early-stage company.

Responsibilities

Define and own SLOs and error budgets across uRun's inference platform and supporting infrastructure.
Build and maintain the observability stack end-to-end: metrics, logging, tracing, and alerting across a distributed GPU compute environment.
Lead incident response: detection, triage, resolution, and blameless postmortems that drive lasting fixes.
Partner with ML infrastructure engineers to embed reliability into the deployment pipeline from day one.
Design and maintain runbooks, on-call rotations, and escalation paths as the team scales.
Drive capacity planning and traffic management across heterogeneous compute to protect latency and availability under load.
Identify and eliminate toil through automation, building systems that scale without scaling the team proportionally.

Benefits

Competitive salary and meaningful equity
Health, dental, and vision — full coverage
401(k) — company-supported retirement savings
FSA/HSA — flexible spending accounts for healthcare costs
Paid time off — we trust you to manage your time
Top-tier tooling — access to the best AI tools available: Claude, Codex, Kimi, and whatever else helps you move faster
MacBook Pro and AirPods — the hardware you need, on us

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume