Senior Site Reliability Engineer

Dealer Tire
$110,000 - $125,000

About The Position

We’re Dealer Tire, a family-owned, international distributor of tires and parts established in 1918 in Cleveland, OH. We’re laser focused on helping the world’s largest and most trusted auto manufacturers grow their tire business—in fact, we’ve sold more than 60 million tires to date. We’re a thriving company, and we’re looking for driven individuals to join our team. That’s where you come in! As a Senior Site Reliability Engineer, you will be a hands-on technical individual contributor embedded within the Core Systems team, responsible for the daily health, stability, and performance of our production environment. You will serve as a primary responder for production incidents, owning triage through resolution — including root cause analysis, infrastructure remediation, and order automation recovery. You will work directly alongside the Manager, Consumer Technology Site Reliability, and Helpdesk to handle day-to-day triage and fix responsibilities, enabling leadership to focus on strategic decisions and team direction. You will also partner with development teams to evaluate production risk before deployment. As Senior Site Reliability Engineer - Core Systems, your essential job responsibilities will include the following:

Requirements

  • 5+ years in a Site Reliability Engineering, DevOps, or Production Support role at a software or e-commerce company.
  • Demonstrated ability to independently diagnose and resolve production incidents, including infrastructure-level failures (servers, queues, batch jobs, APIs).
  • Hands-on experience with AWS (EC2, CloudWatch, or equivalent) for day-to-day operational tasks.
  • Experience with Datadog, New Relic, PagerDuty, or equivalent platforms for monitoring, alerting, and incident detection.
  • Working knowledge of MySQL/relational databases for investigative queries and data validation.
  • Ability to read and analyze complex SQL queries to diagnose production data issues.
  • Familiarity with PHP, Python, Bash, or similar languages sufficient to read, debug, and modify production scripts and automation jobs.
  • Experience with Rundeck, cron, or equivalent batch job management and monitoring tools.
  • Problem-Solving
  • Composure
  • Accountability
  • Detail-Oriented
  • Adaptability
  • Collaborative
  • Proactive Communication
  • Results Orientation
  • Continuous viewing from and inputting data to a computer screen
  • Talking through the computer for many meetings and one-to-one conversations
  • Sitting for long periods of time
  • Travel required (<10%)
  • Dealer Tire is a drug-free environment. All applicants being considered for employment must pass a pre-employment drug screening before beginning work.

Responsibilities

  • Production Triage: Includes all incidents surfaced via the #triage Slack channels, Datadog alerts, Rundeck failures, contact center reports, and proactive monitoring across all business units.
  • Incident Ownership: Serve as the primary on-call responder for production incidents. Acknowledge, investigate, and drive issues to resolution with clear communication throughout the incident lifecycle.
  • Root Cause Analysis: Lead RCA (Root Cause Analysis) for production failures, including order automation breakdowns, Gearman/worker queue degradation, API integration outages, batch job timeouts, and database performance events. Document findings with sufficient detail to support post-mortem review.
  • Hands-On Remediation: Execute infrastructure-level remediation, including EC2 instance restarts, Gearman worker pool resets, Rundeck job recovery, order status resets, and inventory and pricing queue restoration.
  • Regression Identification: Identify deployment-related regressions by correlating incident timelines to recent deployments. Initiate and coordinate revert requests with development teams when causal links are established.
  • Incident Coordination: Direct cross-functional teams during active incidents — assigning investigation tasks, managing parallel workstreams, tracking affected order or customer counts, and keeping all stakeholders informed via Slack threads and JIRA ticket updates.
  • Monitor the entire Consumer Enterprise Group (CEG) Platform processing environment and proactively surface anomalies, enhancement opportunities, and risk areas to leadership.
  • Assist with data cleanup and order recovery operations following production incidents.
  • Support testing and validation of infrastructure changes prior to production deployment.
  • Ensure accurate and timely entry of incident details, findings, and resolutions into JIRA tracking systems.
  • Continue to develop expertise in the CEG codebase, third-party integrations, and operational tooling through working sessions and self-directed learning.
  • Attend improvement opportunities for personal growth and certifications that will enhance effectiveness in the role.
  • Other Duties as assigned.

Benefits

  • paid time off
  • medical
  • dental
  • vision
  • 401k match (50% on the dollar up to 7% of employee contribution)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service