Sr Specialist Site Reliability Engineer

Waystar•Atlanta, GA

61d

About The Position

We are seeking a highly skilled and proactive Senior Specialist, Site Reliability Engineering (SRE) to help drive reliability, scalability, and performance across our critical platforms. This role is ideal for a senior-level engineer who combines deep technical expertise with a passion for automation, observability, and operational excellence. As a Senior Specialist, you'll work on complex reliability challenges, lead technical initiatives, and collaborate across engineering, product, and infrastructure teams to ensure our systems are resilient and efficient.

Requirements

7+ years of experience in SRE, DevOps, or infrastructure engineering.
Deep expertise in cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure-as-code (Terraform, CloudFormation).
Strong proficiency in observability tools (e.g., Prometheus, Grafana, Splunk) and CI/CD pipelines.
Proven track record of solving complex reliability challenges in distributed systems.
Excellent communication and collaboration skills.
Experience in Python, Powershell, or other similar languages
Active use of artificial intelligence (AI) tools and techniques to enhance performance, drive innovation, and improve decision-making across business functions
Ability to leverage AI tools and platforms to streamline workflows, improve decision-making, and drive innovation
Curiosity and adaptability in exploring emerging AI technologies, with a mindset for continuous learning and experimentation

Nice To Haves

Experience in regulated or high-availability environments (e.g., financial services, healthcare).
Familiarity with chaos engineering, performance tuning, and capacity planning.
Background in software development with strong coding skills (e.g., Python, Go, Bash).

Responsibilities

Reliability Engineering
Architect and implement solutions to improve system reliability, scalability, and performance.
Define and manage SLIs/SLOs and error budgets across services.
Lead efforts to automate operational tasks and improve system observability.
Incident Management & Root Cause Analysis
Serve as a technical lead during major incidents and drive resolution.
Conduct deep root cause analyses and implement long-term fixes.
Champion blameless postmortems and continuous improvement.
Technical Leadership
Lead cross-functional reliability initiatives and mentor junior engineers.
Influence system design and architecture to embed reliability from the ground up.
Collaborate with software engineers to optimize deployment pipelines and infrastructure.
Monitoring & Tooling
Enhance observability through metrics, logging, and tracing.
Develop and maintain dashboards, alerts, and automated recovery systems.

Benefits

Competitive total rewards (base salary + bonus, if applicable)
Customizable benefits package (3 medical plans with Health Saving Account company match)
We offer generous paid time off for our non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays
Paid parental leave (including maternity + paternity leave)
Education assistance opportunities and free LinkedIn Learning access
Free mental health and family planning programs, including adoption assistance and fertility support
401(K) program with company match
Pet insurance
Employee resource groups

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume