Head of Infrastructure

Hyperbolic LabsSan Francisco, CA
8d

About The Position

We are hiring a Head of Infrastructure to lead the design, evolution, and reliability of Hyperbolic’s globally distributed GPU cloud. This role sits at the center of our mission: you will architect and scale the systems that power our peer-to-peer GPU marketplace, inference fabric, and core platform primitives. You’ll own the infrastructure roadmap end-to-end—from distributed systems design and resource orchestration to networking, security, and global capacity strategy. You’ll grow and mentor a world-class engineering organization, establish engineering excellence standards, and partner closely with Product, Security, Platform, and GTM leadership to translate future AI workloads into infrastructure reality.

Requirements

  • 10+ years in infrastructure, systems engineering, or distributed systems, including 5+ years leading managers and senior ICs.
  • Proven ability to own multi-year infrastructure roadmaps, align stakeholders, and translate ambiguous requirements into crisp technical direction.
  • Experience building, scaling, and mentoring high-performing engineering orgs across infrastructure, platform, and SRE disciplines.
  • Exceptional judgment in balancing velocity with reliability, cost, and security.
  • Comfortable working in fast-moving, high-stakes environments where infrastructure is the product.
  • Deep expertise in distributed systems, operating systems internals, networking, and resource orchestration.
  • Hands-on experience with container orchestration systems (Kubernetes, Nomad, SLURM, custom schedulers) at global scale.
  • Strong engineering background with the ability to read and write production code (Go, Rust, Python, or similar).
  • Experience architecting multi-cloud + on-prem + edge topologies, including GPU-centric workloads.
  • Expert-level understanding of infrastructure-as-code, automation frameworks, and GitOps workflows.
  • Expertise in designing observability systems (metrics, tracing, logging, alerting) and building operational excellence.
  • A track record of owning 99.9–99.99% uptime targets, incident response processes, and resilience engineering.
  • Passionate about security-first infrastructure, including workload isolation, network security, IAM, hardening, and compliance.
  • Experience leading major capacity planning, load forecasting, and cost optimization initiatives.

Nice To Haves

  • Contributions to open-source infra tools, kernels, schedulers, or distributed systems libraries.
  • Familiarity with service mesh, mTLS, RPC frameworks, or low-latency communication patterns.

Benefits

  • equity
  • health
  • remote policy
  • hardware budget
  • offsites
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service