Site Reliability Engineer

Runpod, Inc.

3h•Remote

About The Position

Runpod is a foundational platform for developers to build and run custom AI systems that scale, serving over 500,000 developers worldwide with an annual recurring revenue run rate exceeding $120M. Founded in 2022, Runpod has rapidly grown by building infrastructure specifically for modern AI workloads, enabling teams to transition from experimentation to deployment across cloud, on-prem, and hybrid environments. As a remote-first, globally distributed company, Runpod is dedicated to building the infrastructure layer that powers the next generation of AI systems. The Reliability team at Runpod is responsible for the availability, performance, and operational excellence of the global platform. This team ensures that systems remain resilient, observable, and scalable under real-world production conditions by defining and enforcing reliability standards, designing incident response processes, building observability systems and reliability tooling, driving SLO adoption, conducting production readiness reviews, and reducing operational toil through automation. The team collaborates cross-functionally with Infrastructure, Product Engineering, and Support to maintain system stability and performance during rapid scaling, valuing proactive problem-solving, automation-first thinking, and strong ownership of production systems. As a Site Reliability Engineer on the Reliability team, you will focus on ensuring the stability and resilience of Runpod’s distributed platform. This involves partnering with engineering teams to enhance system design, strengthen observability, and proactively prevent incidents. The role combines software engineering with production operations, focusing on reliability frameworks, SLO design, automation, and production hardening to reduce errors and improve performance across various services and infrastructure. This is a high-impact role crucial for maintaining developer trust in Runpod's platform for critical AI workloads.

Requirements

5+ years of experience in SRE, Reliability Engineering, or Production Engineering
Strong Linux systems and Networking expertise
Experience managing containerized production systems
Strong understanding of distributed systems and failure modes
Experience defining and managing SLIs/SLOs
Proven incident response and postmortem leadership experience
Strong scripting or programming skills
Experience with monitoring and alerting systems
Excellent written communication skills
Successful completion of a background check

Nice To Haves

Experience with GPU infrastructure or AI/ML platforms
Experience improving reliability in high-growth or large scale environments
Familiarity with GPU observability tooling
Experience with Infrastructure as Code
Experience working in startup environments
Experience building internal reliability platforms or frameworks

Responsibilities

Define and implement SLIs/SLOs for critical services
Lead incident response and coordinate cross-team mitigation efforts
Conduct blameless postmortems and ensure corrective actions are completed
Perform production readiness reviews for new services and features
Identify systemic risks and drive preventative improvements
Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
Improve signal-to-noise ratio in alerts and reduce alert fatigue
Build internal tooling for reliability tracking and reporting
Improve visibility into GPU performance and distributed systems health
Automate recurring operational workflows
Build tools and scripts (Python, Go, Bash) to eliminate manual processes
Improve deployment safety through automation and guardrails
Strengthen CI/CD reliability and release processes
Partner with engineering teams to improve system resilience
Provide guidance on fault tolerance, scalability, and failure handling
Contribute to architectural discussions with a reliability-first mindset

Benefits

Competitive base pay for this position ranges from $150,000- $200,000 usd
Meaningful equity in a fast-growing company- everyone on the team receives stock options
Generous medical, dental & vision plans
Flexible PTO- take the time you need to recharge
Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume