We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platform. You’ll be building and operating the core systems that power agentic AI at scale. Your mission: keep our ultra-low-latency, stateful, serverless compute engine rock-solid as we serve billions of agent requests for the most sophisticated AI teams in the world. This role is highly technical and execution-heavy. You’ll own our reliability posture end-to-end—observability, performance tuning, incident ops, infrastructure health, and the automation systems that keep everything running smoothly. We want you to design new reliability systems, push the boundaries of automation, and continuously evolve the platform to meet the demands of next-generation AI workloads. If you're a builder who thrives on owning critical infrastructure at scale, this role is for you. Collaborating closely with the founders, the infra team, and the dev team—and leveraging AI wherever it creates leverage—you will architect and operate the systems that keep Blaxel fast, resilient, and secure. Blaxel is AWS for AI agents. We’re a new kind of cloud computing infrastructure optimized for the unique demands of agentic AI, leveraging a purpose-built 25ms cold-start serverless compute engine. Now processing billions of agent requests, we power the coding agents and background AI tasks infrastructure for top AI startups. Founders choose us when they hit the limits of general-purpose clouds. We solve the hard infrastructure problems—statefulness, ultra-low latency, and secure sandboxed code execution—so they can focus on building their core AI products. We raised a $7.3M seed round led by First Round Capital.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed