Site Reliability Engineer

Blaxel•San Francisco, CA

49d

About The Position

We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platform. You’ll be building and operating the core systems that power agentic AI at scale. Your mission: keep our ultra-low-latency, stateful, serverless compute engine rock-solid as we serve billions of agent requests for the most sophisticated AI teams in the world. This role is highly technical and execution-heavy. You’ll own our reliability posture end-to-end—observability, performance tuning, incident ops, infrastructure health, and the automation systems that keep everything running smoothly. We want you to design new reliability systems, push the boundaries of automation, and continuously evolve the platform to meet the demands of next-generation AI workloads. If you're a builder who thrives on owning critical infrastructure at scale, this role is for you. Collaborating closely with the founders, the infra team, and the dev team—and leveraging AI wherever it creates leverage—you will architect and operate the systems that keep Blaxel fast, resilient, and secure.

Requirements

3+ years in SRE, DevOps, or infrastructure engineering roles
Strong proficiency in at least one programming language such as Go, Rust, or Python
Hands-on experience with a major cloud provider (AWS, GCP)
Solid knowledge of Linux systems, networking fundamentals, and distributed systems
Experience with bare-metal servers and datacenter operations (PXE/iPXE provisioning, IPMI/BMC, RAID/NVMe, SR-IOV, high-throughput networking)
Experience with Kubernetes or similar orchestrators
Familiarity with observability stacks (Prometheus, Grafana, ELK, Datadog)
Experience building and maintaining CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
Strong debugging, problem-solving, and incident-management skills

Nice To Haves

Experience with infrastructure-as-code tools such as Terraform or Pulumi
Knowledge of service mesh or API gateway technologies
Exposure to chaos engineering or resiliency-testing frameworks
Background in security best practices for cloud environments
Prior experience in high-growth or high-availability environments
Serverless compute systems
Sandboxed execution environments
Ultra-low-latency runtime engineering
Distributed key-value stores and databases
Chaos engineering
Rust, Go, or systems-level programming
Deep generative AI infrastructure

Responsibilities

Architect, operate, and continuously improve the core infrastructure powering our 25ms cold-start compute engine.
Build and evolve our observability stack (metrics, traces, logs), ensuring we detect issues before users do.
Define, monitor, and drive SLOs/SLIs across key system surfaces to maintain world-class reliability.
Lead incident response with rigor: root cause analysis, post-mortems, and driving systemic fixes.
Design and implement self-healing, automated operational systems to eliminate toil and scale ops.
Work across compute, networking, storage, and sandboxed execution layers to tune performance under extreme workloads.
Build automation and tooling—often with AI agents—to streamline operations, debugging, capacity planning, and failure prediction.
Stress-test and push our systems to the edge: load testing, chaos engineering, and performance benchmarking.
Own security best practices at the infrastructure layer, from sandboxed compute to network isolation.
Partner with platform engineers to ensure reliability is designed into new features from day one.