We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platform. You’ll be building and operating the core systems that power agentic AI at scale. Your mission: keep our ultra-low-latency, stateful, serverless compute engine rock-solid as we serve billions of agent requests for the most sophisticated AI teams in the world. This role is highly technical and execution-heavy. You’ll own our reliability posture end-to-end—observability, performance tuning, incident ops, infrastructure health, and the automation systems that keep everything running smoothly. We want you to design new reliability systems, push the boundaries of automation, and continuously evolve the platform to meet the demands of next-generation AI workloads. If you're a builder who thrives on owning critical infrastructure at scale, this role is for you. Collaborating closely with the founders, the infra team, and the dev team—and leveraging AI wherever it creates leverage—you will architect and operate the systems that keep Blaxel fast, resilient, and secure.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed