Staff Site Reliability Engineer

Jobgether
3d$150,000 - $225,000Remote

About The Position

This role offers the opportunity to lead and shape the reliability, scalability, and performance of large-scale SaaS systems. You will work in a high-impact, hands-on environment where your decisions directly influence uptime, operational excellence, and engineering efficiency. As a Staff SRE, you will drive best practices in reliability, observability, and incident management while mentoring other engineers. You’ll design and implement tools and frameworks that enable teams to own the reliability of their services, champion an SRE culture across engineering, and lead critical incidents with precision. This position is ideal for someone passionate about building mission-critical systems and scaling engineering operations globally.

Requirements

  • 8+ years of experience in Site Reliability Engineering, DevOps, or related roles, including 3+ years in a Senior or higher SRE capacity.
  • Hands-on experience running production SaaS systems at scale with a focus on reliability and uptime.
  • Proficiency in at least one programming or scripting language (Python, Go, or similar).
  • Strong experience with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes).
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing).
  • Expertise in monitoring, alerting, and observability tools (Prometheus, Grafana, Datadog, ELK, OTEL).
  • Proven experience leading high-severity incidents and postmortems.
  • Excellent troubleshooting, collaboration, and communication skills across engineering teams.

Nice To Haves

  • Hands-on knowledge of AIOps, continuous profiling, and advanced observability practices.
  • Prior experience mentoring engineers and shaping SRE culture at scale.
  • Familiarity with reliability automation frameworks, CI/CD pipelines, and operational tooling.

Responsibilities

  • Architecting and building frameworks, self-service tooling, and “reliability paved paths” that empower teams to own service reliability.
  • Driving automation and AI-driven operational strategies for diagnostics, remediation, and failure prevention.
  • Leading and coordinating incident responses, acting as Incident Commander during high-severity events, and ensuring blameless postmortems drive lasting improvements.
  • Embedding SRE best practices in engineering workflows, including design reviews, production readiness, and operational standards.
  • Enhancing observability across systems with end-to-end monitoring, tracing, and profiling tools.
  • Mentoring engineers across product and SRE teams, sharing technical knowledge, and raising the overall reliability bar.

Benefits

  • Competitive base salary: $150,000–$225,000 USD, with final offers based on experience and expertise.
  • Potential equity participation and performance-based incentives.
  • Comprehensive healthcare including medical, dental, and vision insurance.
  • Flexible and remote-friendly work environment.
  • Generous paid time-off and holidays.
  • Professional development opportunities and mentorship programs.
  • Inclusive culture that values collaboration, learning, and engineering excellence.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service