Senior Software Engineer, Site Reliability Engineering

Ridgeline•Reno, NV

21h•$153,000 - $210,000•Hybrid

About The Position

As a Site Reliability Engineer, you'll help ensure the reliability, scalability, and operational excellence of Ridgeline's mission-critical SaaS platform. You'll partner closely with product and platform engineers to improve service reliability, accelerate engineering velocity through automation, and build systems that are easier to operate from day one. Our team of engineers are building with cutting-edge technologies—like Claude Code and Cursor—in a fast-moving, creative, progressive work environment. You'll play a key role in advancing our observability, release engineering, incident response, and automation capabilities while contributing measurable improvements to platform stability and developer productivity. At Ridgeline, how we work matters as much as what we build. Ridgeliners act like owners, choose growth over comfort, and communicate with transparency. We assume positive intent, bias toward action, and bring solutions—not just problems. We celebrate wins, learn from setbacks, and thrive in a resilient, collaborative, high-performing culture. If this excites you, we'd love to meet you! You must be work authorized in the United States without the need for employer sponsorship.

Requirements

3–6 years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or a related discipline.
At least 2 years supporting mission-critical production SaaS workloads running on AWS.
Experience operating production systems where uptime, performance, and reliability are business critical.
Hands-on experience with AWS services including EC2, ECS or EKS, RDS, S3, IAM, CloudWatch, and managed database or messaging services.
Strong understanding of observability, including monitoring, alerting, distributed tracing, and production diagnostics.
Experience designing or significantly improving CI/CD pipelines using tools such as GitHub Actions, CircleCI, Buildkite, or similar platforms.
Experience with deployment strategies including blue/green, canary, or progressive rollouts.
Proficiency in Python, Go, Bash, or another scripting language used for automation and tooling.
Experience implementing Infrastructure as Code using Terraform.
Comfortable participating in an on-call rotation and leading incident response with composure.
Excellent communication skills with the ability to explain technical concepts to both technical and non-technical stakeholders.
Demonstrated ability to make measurable improvements to platform reliability, operational efficiency, or developer productivity.
Strong analytical and troubleshooting skills with a passion for solving complex technical challenges.
A collaborative mindset with a desire to learn, mentor others, and contribute to a positive engineering culture.

Nice To Haves

Experience with Kubernetes and Helm.
Familiarity with chaos engineering or fault injection practices.
Experience building or contributing to SLO and error budget programs.
Working knowledge of Kotlin, Node.js, or TypeScript.
Experience supporting highly distributed cloud-native applications.
Bachelor's degree in Computer Science, Information Systems, or a related technical discipline.

Responsibilities

Improve the reliability, availability, and performance of Ridgeline's mission-critical production SaaS platform.
Build automation that measurably increases engineering velocity while reducing operational toil.
Own and improve production observability through metrics, structured logging, distributed tracing, dashboards, and actionable alerting.
Design and enhance CI/CD pipelines, deployment automation, progressive delivery strategies, and rollback mechanisms.
Define and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budget practices to proactively manage reliability.
Identify capacity constraints and reliability risks before they impact customers.
Participate in an on-call rotation, triaging production issues, coordinating incident response, and driving issues to resolution with very infrequent after-hours support.
Lead blameless postmortems and implement long-term improvements that strengthen platform resilience.
Partner with software engineers on infrastructure design reviews to build highly operable, scalable services.
Develop Infrastructure as Code solutions using Terraform and AWS best practices.
Collaborate across a distributed engineering organization while fostering a culture of ownership, transparency, learning, and continuous improvement.