Founding Engineer - Site Reliability

uRunUnited States, CA
$185,000 - $285,000Remote

About The Position

uRun is building the inference cloud for interactive AI, focusing on the compute layer that enables real-time, stateful inference at scale. As a founding Site Reliability Engineer, you will be instrumental in establishing the reliability culture from the ground up. This includes defining the observability stack, incident response playbooks, SLOs, and the on-call process. You will collaborate closely with infrastructure and platform engineers to ensure the stability and performance of the inference platform.

Requirements

  • 7+ years in site reliability, production engineering, or infrastructure engineering in a high-availability, low-latency environment.
  • Deep experience owning SLOs, error budgets, and on-call processes in production at scale.
  • Strong observability background: you have built or owned monitoring stacks (Prometheus, Grafana, Datadog, or equivalent) and know what good alerting looks like.
  • Proven incident response experience: you have led real incidents under pressure and written postmortems that actually changed behaviour.
  • Hands-on with Kubernetes and cloud infrastructure (AWS preferred): you can debug a failing pod and a misconfigured VPC in the same afternoon.
  • Strong software engineering fundamentals: you write automation, not just runbooks.
  • Comfortable operating as the first and only SRE, setting standards without a template to follow.

Nice To Haves

  • Experience supporting GPU compute or ML inference infrastructure in production.
  • Familiarity with stateful workloads, long-running sessions, or streaming inference systems.
  • Exposure to multi-tenant platforms where isolation, noisy neighbour problems, and billing-aware scheduling matter.
  • Prior founding or sole SRE experience at an early-stage company.

Responsibilities

  • Define and own SLOs and error budgets across uRun's inference platform and supporting infrastructure.
  • Build and maintain the observability stack end-to-end: metrics, logging, tracing, and alerting across a distributed GPU compute environment.
  • Lead incident response: detection, triage, resolution, and blameless postmortems that drive lasting fixes.
  • Partner with ML infrastructure engineers to embed reliability into the deployment pipeline from day one.
  • Design and maintain runbooks, on-call rotations, and escalation paths as the team scales.
  • Drive capacity planning and traffic management across heterogeneous compute to protect latency and availability under load.
  • Identify and eliminate toil through automation, building systems that scale without scaling the team proportionally.

Benefits

  • Competitive salary and meaningful equity
  • Health, dental, and vision — full coverage
  • 401(k) — company-supported retirement savings
  • FSA/HSA — flexible spending accounts for healthcare costs
  • Paid time off — we trust you to manage your time
  • Top-tier tooling — access to the best AI tools available: Claude, Codex, Kimi, and whatever else helps you move faster
  • MacBook Pro and AirPods — the hardware you need, on us
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service