Site Reliability

OpenEvidence•Miami, FL

2d•Hybrid

About The Position

As an engineer working on Site Reliability, you'll be a key architect and driver in building and hardening the mission-critical infrastructure powering our medical AI platform used by healthcare providers worldwide. This role combines exceptional technical scope with direct impact, focusing on the systemic health, performance, and efficiency of our entire production ecosystem. You'll join our talented backend team in architecting and scaling our infrastructure, applying the SRE mindset to reduce toil, improve observability, and define robust Service Level Objectives (SLOs) across our services and data platforms. You will have significant autonomy to make architectural decisions and drive initiatives across performance optimization, infrastructure design, security, and data pipelines handling sensitive medical data at scale. We're looking for a backend expert who thrives in a focused startup environment where technical excellence meets rapid iteration. You'll work directly with engineering leadership to translate business objectives into elegant technical solutions. The ideal candidate has a proven track record of building and scaling production systems, thinks deeply about system design, and is energized by the challenge of building healthcare infrastructure that must be both highly innovative and extremely reliable.

Requirements

B.S. or higher in computer science or related major
4+ years of software engineering experience
Firm grasp of the SRE philosophy and mindset, with practical experience working on or directly with SRE teams that have proactively engaged in system design and improvement.
Willingness to proactively engage with development teams to influence the course of software design and operational practices.
Capability to manage risk, make decisions, and exhibit sound judgment
High proficiency operating backend services at scale
Moderate proficiency with Google Cloud or high proficiency with any public cloud
Moderate proficiency with Postgres or high proficiency with another relational DB
Experience with Django, Django REST Framework, Postgres
Motivation, drive, and ability to operate independently

Responsibilities

Design and institute automated, low-toil operational practices for system health, performance, and scalability, embracing the SRE mindset.
Engage in the end-to-end design, development, and deployment of production software, ensuring built-in reliability and performance from the start.
Own, operate, and optimize key backend services and resources (databases, caches, load balancers), driving measurable improvements in system efficiency and speed.
Lead continuous risk mitigation, incident response, and conduct blameless postmortems to enhance system resilience.
Partner with engineering and product teams to translate platform requirements into robust technical execution and contribute to product strategy.
On-call escalation rotation (approximately one week per month, US daylight hours).

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume