Head of Infrastructure

Hyperbolic Labs•San Francisco, CA

About The Position

We are hiring a Head of Infrastructure to lead the design, evolution, and reliability of Hyperbolic’s globally distributed GPU cloud. This role sits at the center of our mission: you will architect and scale the systems that power our peer-to-peer GPU marketplace, inference fabric, and core platform primitives. You’ll own the infrastructure roadmap end-to-end—from distributed systems design and resource orchestration to networking, security, and global capacity strategy. You’ll grow and mentor a world-class engineering organization, establish engineering excellence standards, and partner closely with Product, Security, Platform, and GTM leadership to translate future AI workloads into infrastructure reality.

Requirements

10+ years in infrastructure, systems engineering, or distributed systems, including 5+ years leading managers and senior ICs.
Proven ability to own multi-year infrastructure roadmaps, align stakeholders, and translate ambiguous requirements into crisp technical direction.
Experience building, scaling, and mentoring high-performing engineering orgs across infrastructure, platform, and SRE disciplines.
Exceptional judgment in balancing velocity with reliability, cost, and security.
Comfortable working in fast-moving, high-stakes environments where infrastructure is the product.
Deep expertise in distributed systems, operating systems internals, networking, and resource orchestration.
Hands-on experience with container orchestration systems (Kubernetes, Nomad, SLURM, custom schedulers) at global scale.
Strong engineering background with the ability to read and write production code (Go, Rust, Python, or similar).
Experience architecting multi-cloud + on-prem + edge topologies, including GPU-centric workloads.
Expert-level understanding of infrastructure-as-code, automation frameworks, and GitOps workflows.
Expertise in designing observability systems (metrics, tracing, logging, alerting) and building operational excellence.
A track record of owning 99.9–99.99% uptime targets, incident response processes, and resilience engineering.
Passionate about security-first infrastructure, including workload isolation, network security, IAM, hardening, and compliance.
Experience leading major capacity planning, load forecasting, and cost optimization initiatives.

Nice To Haves

Contributions to open-source infra tools, kernels, schedulers, or distributed systems libraries.
Familiarity with service mesh, mTLS, RPC frameworks, or low-latency communication patterns.