Senior Site Reliability Engineer

UnitedHealth Group•Eden Prairie, MN

3h•$91,700 - $163,700•Remote

About The Position

The Site Reliability Engineering (SRE) team at Optum Financial ensures world-class reliability, scalability, security, compliance, and performance of a scalable infrastructure platform that powers diverse financial products. We exist so our customers, partners, and engineers can trust and innovate financial products without fear and with velocity. As a Senior SRE, you will lead our mission to own the tools, platforms, and processes that enable success. Our team is driving modern observability practices with OpenTelemetry and the adoption of SLOs as reliability measures. You will be instrumental in automating our environment and building AI-enhanced platforms to support the next generation of financial technology. You will enjoy the flexibility to telecommute from anywhere within the U.S. as you take on some tough challenges.

Requirements

5+ years of experience in software engineering, DevOps, or Site Reliability Engineering (SRE) roles
2+ years of experience implementing and supporting observability and monitoring tools (e.g., OpenTelemetry, Datadog, Splunk, Dynatrace)
2+ years of experience defining and maintaining SLIs, SLOs, and production alerting strategies.
2+ years of experience working in cloud environments (Azure or AWS)
1+ years of experience supporting containerized applications (e.g., Kubernetes, Docker)

Nice To Haves

Bachelor’s degree in Computer Science, Information Technology, or a related field
2+ years of experience with CI/CD tools (e.g., Jenkins, GitHub Actions, ArgoCD)
1+ years of experience with infrastructure as code tools (e.g., Terraform, Pulumi)
1+ years of experience participating in incident response and root cause analysis (RCA) processes
Direct experience developing automation for operational workflows or reliability engineering tasks
Exposure to AI/ML concepts or practical experience applying automation to improve operational efficiency

Responsibilities

Design, develop, and deploy AI-powered solutions to address complex infrastructure and reliability challenges with an emphasis on the responsible use of AI
Implement and support observability and monitoring solutions using tools such as OpenTelemetry, Datadog, Splunk, and Dynatrace to improve system visibility and reliability
Define, implement, and maintain service level indicators (SLIs), service level objectives (SLOs), and actionable alerting strategies in partnership with engineering teams
Use and evaluate enterprise-approved AI tools to streamline workflows, automate tasks, and drive continuous improvement across the platform
Develop and maintain automation to improve operational efficiency, including alerting, incident analysis, and recovery workflows
Support incident response processes, including troubleshooting, root cause analysis (RCA), and implementation of corrective actions to prevent recurrence
Support cloud-based infrastructure (Azure or AWS) and containerized environments (Kubernetes, Docker) to enhance scalability, stability, and efficiency
Evaluate emerging technology trends to inform solution design and strategic innovation for the SRE platform
Contribute to the development of SRE platform capabilities, including self-healing systems and automated operational processes
Partner with cross-functional teams to promote adoption of SRE best practices and improve overall system reliability