Site Reliability Engineer (SRE)

Florence Healthcare - US•Atlanta, GA

44d

About The Position

We are seeking a Site Reliability Engineer (SRE) to join one of our Scrum teams and help ensure the reliability, scalability, and performance of the Florence™ platform. AI-driven tooling and automation are a cornerstone of how we build, operate, and scale our systems. In this role, you will work closely with product engineers while actively leveraging AI to improve observability, incident response, automation, and overall platform reliability. Coding assignments in this role will require working with AI-assisted development workflows as a core part of how solutions are designed and delivered.

Requirements

Passionate about building reliable, scalable systems using modern, AI-enabled approaches
Strong understanding of cloud-native and distributed system architectures
Experience applying SRE principles in a production environment
Hands-on experience with cloud platforms (AWS preferred)
Experience using AI-assisted tools for coding, debugging, automation, or operational analysis
Strong background in Linux, networking, and system operations
Experience with infrastructure-as-code and automation tools (e.g., Terraform, CI/CD pipelines)
Familiarity with modern observability practices (metrics, logs, tracing), including AI-enhanced analysis
Comfortable working as part of an agile, cross-functional Scrum team
Strong problem-solving, communication, and collaboration skills
4+ years of experience in SRE, DevOps, or similar roles
Experience supporting production systems at scale

Responsibilities

Be an embedded member of a Scrum team, participating in planning, refinement, reviews, and retrospectives
Use AI-powered tools to enhance system reliability, operational efficiency, and developer productivity
Design, build, and operate reliable, scalable cloud infrastructure supporting platform and product services
Apply AI-assisted analysis to monitoring, alerting, and observability data to detect, predict, and prevent incidents
Define and maintain SLOs, SLIs, and error budgets to guide reliability decisions
Collaborate with software engineers to embed reliability and AI-driven automation into the software development lifecycle
Lead and participate in incident response, root cause analysis, and postmortems, leveraging AI insights where appropriate
Automate operational tasks and reduce toil through AI-enabled and traditional automation approaches
Contribute to disaster recovery planning, testing, and operational readiness
Produce and maintain documentation such as runbooks, operational guides, and system diagrams
Contribute code as a secondary responsibility, with coding assignments focused on building reliability tooling, automation, and integrations using AI-assisted development practices