Senior Site Reliability Engineer

O'Reilly Auto Parts•Headquarters, KY

About The Position

Site Reliability Engineers are responsible for ensuring the availability, reliability, scalability, and performance of the firm’s most critical customer-facing microservices that power all eCommerce channels. This role applies Google-inspired SRE principles to balance feature velocity and system reliability using Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets. The role combines software engineering, cloud engineering, automation, and production operations, with a strong emphasis on building systems that are observable, resilient, and operable by default.

Requirements

4+ years of experience in SRE, software engineering, or production operations supporting large-scale eCommerce platforms.
Hands-on experience with Java/J2EE-based distributed systems; React experience is a plus.
Proven ability to design and operate systems using SLO-driven reliability models.
Experience defining and measuring SLIs, including availability, latency, error rates, throughput, and saturation.
Good understanding of NoSQL technologies and RDBMS concepts, with the ability to write and troubleshoot database queries.
Experience deploying and operating services on cloud platforms such as AWS, Azure, or Google Cloud Platform (GCP).
Expertise with observability, APM, and caching tools such as Dynatrace, Splunk, ELK, Akamai, Quantum Metric, and Tealeaf.
Strong experience using Jira for backlog management, incident tracking, toil reduction initiatives, and cross-team coordination.
Ability to independently own services and drive reliability initiatives end-to-end.
Strong communication skills with the ability to influence engineering and product teams.
Experience participating in on-call rotations and handling critical/high-severity incidents.

Nice To Haves

Experience building and operating microservices architectures using Spring Boot, Groovy, React, or similar technologies.
Strong understanding of CI/CD pipelines, release automation, and progressive delivery practices.
Experience working within eCommerce domains such as Catalog, Customer Data, and Order Management.
Familiarity with search platforms including Endeca, Solr, Lucene, and Elasticsearch.
Proficiency in scripting and automation using Python, Bash, Ruby, Perl, or PowerShell.
Experience with ITSM tools integrated with Jira workflows.
Exposure to capacity planning, load testing, and chaos engineering practices.
Experience with containerization and orchestration technologies such as Docker and Kubernetes (EKS, AKS, or GKE).
Familiarity with Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, or Ansible.
Understanding of operational KPIs including availability, MTTR, MTTD, deployment success rate, and incident recurrence metrics.
Experience conducting production readiness reviews and implementing operational governance processes.
Ability to collaborate with security and platform engineering teams to ensure reliability, compliance, and operational security best practices.
Exposure to AI-assisted operations, anomaly detection, intelligent alerting, and automated remediation solutions.
Experience designing scalable, self-healing platforms and automation frameworks for cloud-native environments.

Responsibilities

Define, implement, and own SLIs, SLOs, and error budgets for critical microservices in collaboration with product and engineering teams.
Use error budgets to influence release decisions, prioritize reliability initiatives, and manage operational risk.
Design and maintain observability platforms, including metrics, logs, traces, and real-time telemetry.
Track, manage, and reduce operational toil by converting repetitive operational tasks into Jira stories and epics with clear ownership and measurable outcomes.
Design, implement, and validate resiliency mechanisms such as graceful degradation, redundancy, automated failover, and disaster recovery.
Lead incident response efforts, act as an escalation point for high-severity incidents, and drive blameless postmortems.
Capture incident action items and reliability improvements in Jira, ensuring accountability, closure, and continuous improvement.
Partner with Scrum teams to improve reliability through release readiness reviews, production change validation, and testing strategies.
Perform deep root cause analysis, debugging, and performance tuning across distributed systems.
Promote shift-left reliability practices by embedding operability, monitoring, and failure testing early in the SDLC.
Drive continuous improvement through automation, self-healing systems, chaos engineering, and capacity planning.
Maintain runbooks, playbooks, and knowledge repositories, linking documentation to Jira tasks to reduce MTTR.
Provide technical leadership and mentoring to junior SREs and engineers.
Collaborate with global, distributed teams, leveraging Jira for transparent planning, dependency tracking, and execution.
Conduct production readiness reviews and ensure services meet operational excellence standards before deployment.
Track and improve operational KPIs such as availability, MTTR, MTTD, deployment success rate, and incident recurrence.
Collaborate with security and platform teams to ensure reliability, compliance, and operational security best practices are embedded into systems and deployment pipelines.
Explore opportunities to leverage AI-driven observability, anomaly detection, and operational automation to improve system reliability and reduce manual effort.

Benefits

Competitive Wages & Paid Time Off
Stock Purchase Plan & 401k with Employer Contributions Starting Day One
Medical, Dental, & Vision Insurance with Optional Flexible Spending Account (FSA)
Team Member Health/Wellbeing Programs
Tuition Educational Assistance Programs
Opportunities for Career Growth

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume