Site Reliability Engineer

Runpod, Inc.
Remote

About The Position

Runpod is a foundational platform for developers to build and run custom AI systems that scale, serving over 500,000 developers worldwide with an annual recurring revenue run rate exceeding $120M. Founded in 2022, Runpod has rapidly grown by building infrastructure specifically for modern AI workloads, enabling teams to transition from experimentation to deployment across cloud, on-prem, and hybrid environments. As a remote-first, globally distributed company, Runpod is dedicated to building the infrastructure layer that powers the next generation of AI systems. The Reliability team at Runpod is responsible for the availability, performance, and operational excellence of the global platform. This team ensures that systems remain resilient, observable, and scalable under real-world production conditions by defining and enforcing reliability standards, designing incident response processes, building observability systems and reliability tooling, driving SLO adoption, conducting production readiness reviews, and reducing operational toil through automation. The team collaborates cross-functionally with Infrastructure, Product Engineering, and Support to maintain system stability and performance during rapid scaling, valuing proactive problem-solving, automation-first thinking, and strong ownership of production systems. As a Site Reliability Engineer on the Reliability team, you will focus on ensuring the stability and resilience of Runpod’s distributed platform. This involves partnering with engineering teams to enhance system design, strengthen observability, and proactively prevent incidents. The role combines software engineering with production operations, focusing on reliability frameworks, SLO design, automation, and production hardening to reduce errors and improve performance across various services and infrastructure. This is a high-impact role crucial for maintaining developer trust in Runpod's platform for critical AI workloads.

Requirements

  • 5+ years of experience in SRE, Reliability Engineering, or Production Engineering
  • Strong Linux systems and Networking expertise
  • Experience managing containerized production systems
  • Strong understanding of distributed systems and failure modes
  • Experience defining and managing SLIs/SLOs
  • Proven incident response and postmortem leadership experience
  • Strong scripting or programming skills
  • Experience with monitoring and alerting systems
  • Excellent written communication skills
  • Successful completion of a background check

Nice To Haves

  • Experience with GPU infrastructure or AI/ML platforms
  • Experience improving reliability in high-growth or large scale environments
  • Familiarity with GPU observability tooling
  • Experience with Infrastructure as Code
  • Experience working in startup environments
  • Experience building internal reliability platforms or frameworks

Responsibilities

  • Define and implement SLIs/SLOs for critical services
  • Lead incident response and coordinate cross-team mitigation efforts
  • Conduct blameless postmortems and ensure corrective actions are completed
  • Perform production readiness reviews for new services and features
  • Identify systemic risks and drive preventative improvements
  • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
  • Improve signal-to-noise ratio in alerts and reduce alert fatigue
  • Build internal tooling for reliability tracking and reporting
  • Improve visibility into GPU performance and distributed systems health
  • Automate recurring operational workflows
  • Build tools and scripts (Python, Go, Bash) to eliminate manual processes
  • Improve deployment safety through automation and guardrails
  • Strengthen CI/CD reliability and release processes
  • Partner with engineering teams to improve system resilience
  • Provide guidance on fault tolerance, scalability, and failure handling
  • Contribute to architectural discussions with a reliability-first mindset

Benefits

  • Competitive base pay for this position ranges from $150,000- $200,000 usd
  • Meaningful equity in a fast-growing company- everyone on the team receives stock options
  • Generous medical, dental & vision plans
  • Flexible PTO- take the time you need to recharge
  • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service