Senior Site Reliability Engineer (Hybrid)

Rewards Network•Chicago, IL

16d•Hybrid

About The Position

The Site Reliability Engineer supports deployments, cloud infrastructure, and monitoring systems that power Rewards Network's applications and services. This role exists to ensure reliable, secure, and scalable operations across our Kubernetes clusters, AWS environments, and observability platforms. We are hiring an experienced Site Reliability Engineer to be focused on supporting our engineering teams with deployments, troubleshooting, and infrastructure improvements. You'll be joining a small, senior SRE team with broad ownership of the platforms and infrastructure that power everything Rewards Network runs on. This position is well-suited for someone with strong hands-on experience who can quickly get up to speed and begin making meaningful contributions. We’re open to hiring at the mid to senior level based on experience. This is a hybrid position that requires in office presence 3 days a week (Tuesday-Thursday) in Chicago.

Requirements

Kubernetes administration and troubleshooting.
Infrastructure as code using Terraform or similar tools (we use Terraform with Atlantis).
AWS services (EC2, S3, IAM, RDS, etc.).
Monitoring and observability tools (Grafana, Prometheus, Elasticsearch).
Secrets management with HashiCorp Vault or similar tools.
Linux system administration and Docker containerization.
Proficiency in at least one non-shell programming language for building tooling and automation (we use Go).
CI/CD pipeline management and deployment automation (we use GitLab CI and TeamCity).
Familiarity with Kafka and Logstash.
Experience with incident response and operational support best practices.
Ability to balance ad hoc support requests with project priorities.
Strong communication skills to work effectively across technical and non-technical teams.
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
5+ years of experience in Site Reliability Engineering, DevOps, or related infrastructure roles.

Responsibilities

Support and improve deployment pipelines to production and staging environments, with a focus on reliability, consistency, and reducing toil.
Troubleshoot and resolve Kubernetes cluster and application-level issues, Docker containers, and Linux-based environments to support applications and services.
Leverage Grafana, Prometheus, and Elasticsearch to monitor, diagnose, and improve system health.
Build and improve internal tooling and automation to improve developer experience.
Partner with development teams to address infrastructure and deployment needs, both planned and ad hoc.
Maintain and improve AWS infrastructure using Terraform and Atlantis.
Manage secrets and security operations with HashiCorp Vault.
Participate in an on-call rotation to support production systems and incident response.
Collaborate across teams to improve system observability, resilience, and automation.
Document processes and contribute to knowledge sharing to improve engineering efficiency.