Runpod is a foundational platform for developers to build and run custom AI systems that scale, serving over 500,000 developers worldwide with an annual recurring revenue run rate exceeding $120M. Founded in 2022, Runpod has rapidly grown by building infrastructure specifically for modern AI workloads, enabling teams to transition from experimentation to deployment across cloud, on-prem, and hybrid environments. As a remote-first, globally distributed company, Runpod is dedicated to building the infrastructure layer that powers the next generation of AI systems. The Reliability team at Runpod is responsible for the availability, performance, and operational excellence of the global platform. This team ensures that systems remain resilient, observable, and scalable under real-world production conditions by defining and enforcing reliability standards, designing incident response processes, building observability systems and reliability tooling, driving SLO adoption, conducting production readiness reviews, and reducing operational toil through automation. The team collaborates cross-functionally with Infrastructure, Product Engineering, and Support to maintain system stability and performance during rapid scaling, valuing proactive problem-solving, automation-first thinking, and strong ownership of production systems. As a Site Reliability Engineer on the Reliability team, you will focus on ensuring the stability and resilience of Runpod’s distributed platform. This involves partnering with engineering teams to enhance system design, strengthen observability, and proactively prevent incidents. The role combines software engineering with production operations, focusing on reliability frameworks, SLO design, automation, and production hardening to reduce errors and improve performance across various services and infrastructure. This is a high-impact role crucial for maintaining developer trust in Runpod's platform for critical AI workloads.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed