Site Reliability Engineer

SafeRide Health

124d

About The Position

SafeRide Health is seeking a Site Reliability Engineer to develop and implement new processes that support software delivery excellence and operational discipline, to ensure that user-facing services and production systems remain highly available, reliable, and scalable. Key responsibilities include defining and monitoring Service Level Objectives (SLOs), responding to and resolving incidents, developing automation for operational tasks, performing capacity planning, and collaborating with development teams to mitigate operational risks and improve system design.

Requirements

Minimum of 5 years progressive experience in an IT, Software Engineering, Technology Operations, or Business Continuity role.
Minimum of 2 years of hands-on experience in a Site Reliability, DevOps, or IT Observability role.
Demonstrated proficiency with production monitoring and alerting tools (DataDog is a major plus!).
Basic proficiency in an AWS containerized environment running infrastructure as code.

Nice To Haves

Expertise in major cloud platforms such as AWS and Azure.
Deep knowledge of operating systems, networking, storage, and distributed systems.
Experience with tools for infrastructure as code (e.g., Terraform), containerization (e.g., Docker), and APM/monitoring (e.g., Prometheus, DataDog, New Relic, Grafana, Splunk).
Proficiency in coding languages like Python, Ruby, and JavaScript for developing automation and managing infrastructure.
Strong communication and collaboration skills to work effectively with development, operations, and other cross-functional teams.

Responsibilities

Keeping systems and services running smoothly with minimal downtime by focusing on availability, reliability, and scalability.
Developing and maintaining tools and scripts to automate repetitive tasks such as deployments, configuration management, and monitoring.
Implementing and managing monitoring and alerting systems to provide visibility into system performance and quickly detect potential issues.
Responding to, diagnosing, and resolving system incidents, including conducting post-mortems to prevent future occurrences.
Monitoring system resource usage to forecast future needs and scale systems accordingly to handle increasing user load.
Collaborating with stakeholders to identify operational risks and implementing strategies to reduce their likelihood and impact.
Analyzing metrics from operating systems and applications to identify areas for performance improvement.

Benefits

Competitive compensation and performance-based bonus potential
Full medical, dental, and vision coverage
Generous PTO and paid company holidays
401(k) with employer match
Paid parental leave and family support benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

251-500 employees

Site Reliability Engineer

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company