Senior Site Reliability Engineer II

Waystar•Atlanta, GA

About The Position

We are looking for a talented and driven Sr. Site Reliability Engineering (SRE) to support our engineering team, which manages the infrastructure and services that power our Waystar products. This role is ideal for an experienced engineer who thrives in data-intensive environments and is passionate about building reliable, scalable systems that ensure data integrity, availability, and performance. As an SRE Specialist, you’ll work closely with engineering, product, and data teams to ensure our data licensing platforms are resilient, observable, and continuously improving.

Requirements

7+ years of experience in SRE, DevOps, or infrastructure engineering.
Strong understanding of cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure-as-code (Terraform, CloudFormation).
Experience with observability tools (e.g., Prometheus, Grafana, Splunk) and CI/CD pipelines.
Familiarity with data platforms, ETL pipelines, and distributed systems.
Excellent problem-solving and communication skills.
Experience with Python, Powershell, and other similar languages
Active use of artificial intelligence (AI) tools and techniques to enhance performance, drive innovation, and improve decision-making across business functions
Ability to leverage AI tools and platforms to streamline workflows, improve decision-making, and drive innovation
Curiosity and adaptability in exploring emerging AI technologies, with a mindset for continuous learning and experimentation

Nice To Haves

Experience with data licensing, data governance, or data compliance frameworks.
Exposure to data pipeline tools (e.g., Apache Airflow, Kafka, Spark).
Familiarity with regulatory requirements related to data usage and distribution.

Responsibilities

Design and implement reliability solutions for data ingestion, processing, and delivery pipelines.
Define and maintain SLIs/SLOs for data licensing services and manage error budgets.
Build automation for deployment, monitoring, and incident response.
Enhance system observability through metrics, logging, and tracing.
Develop and maintain dashboards and alerts to proactively detect and resolve issues.
Participate in on-call rotations and lead incident response efforts.
Conduct root cause analysis and drive post-incident improvements.
Maintain runbooks and operational documentation.
Partner with software and data engineers to embed reliability into system design.
Contribute to blameless postmortems and reliability reviews.
Share knowledge and mentor junior team members.

Benefits

Competitive total rewards (base salary + bonus, if applicable)
Customizable benefits package (3 medical plans with Health Saving Account company match)
Generous paid time off for non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays.
Flexible time off for exempt team members + 13 paid holidays
Paid parental leave (including maternity + paternity leave)
Education assistance opportunities and free LinkedIn Learning access
Free mental health and family planning programs, including adoption assistance and fertility support
401(K) program with company match
Pet insurance
Employee resource groups