Senior Site Reliability Engineer

UnitedHealth GroupEden Prairie, MN
$91,700 - $163,700Remote

About The Position

The Site Reliability Engineering (SRE) team at Optum Financial ensures world-class reliability, scalability, security, compliance, and performance of a scalable infrastructure platform that powers diverse financial products. We exist so our customers, partners, and engineers can trust and innovate financial products without fear and with velocity. As a Senior SRE, you will lead our mission to own the tools, platforms, and processes that enable success. Our team is driving modern observability practices with OpenTelemetry and the adoption of SLOs as reliability measures. You will be instrumental in automating our environment and building AI-enhanced platforms to support the next generation of financial technology. You will enjoy the flexibility to telecommute from anywhere within the U.S. as you take on some tough challenges.

Requirements

  • 5+ years of experience in software engineering, DevOps, or Site Reliability Engineering (SRE) roles
  • 2+ years of experience implementing and supporting observability and monitoring tools (e.g., OpenTelemetry, Datadog, Splunk, Dynatrace)
  • 2+ years of experience defining and maintaining SLIs, SLOs, and production alerting strategies.
  • 2+ years of experience working in cloud environments (Azure or AWS)
  • 1+ years of experience supporting containerized applications (e.g., Kubernetes, Docker)

Nice To Haves

  • Bachelor’s degree in Computer Science, Information Technology, or a related field
  • 2+ years of experience with CI/CD tools (e.g., Jenkins, GitHub Actions, ArgoCD)
  • 1+ years of experience with infrastructure as code tools (e.g., Terraform, Pulumi)
  • 1+ years of experience participating in incident response and root cause analysis (RCA) processes
  • Direct experience developing automation for operational workflows or reliability engineering tasks
  • Exposure to AI/ML concepts or practical experience applying automation to improve operational efficiency

Responsibilities

  • Design, develop, and deploy AI-powered solutions to address complex infrastructure and reliability challenges with an emphasis on the responsible use of AI
  • Implement and support observability and monitoring solutions using tools such as OpenTelemetry, Datadog, Splunk, and Dynatrace to improve system visibility and reliability
  • Define, implement, and maintain service level indicators (SLIs), service level objectives (SLOs), and actionable alerting strategies in partnership with engineering teams
  • Use and evaluate enterprise-approved AI tools to streamline workflows, automate tasks, and drive continuous improvement across the platform
  • Develop and maintain automation to improve operational efficiency, including alerting, incident analysis, and recovery workflows
  • Support incident response processes, including troubleshooting, root cause analysis (RCA), and implementation of corrective actions to prevent recurrence
  • Support cloud-based infrastructure (Azure or AWS) and containerized environments (Kubernetes, Docker) to enhance scalability, stability, and efficiency
  • Evaluate emerging technology trends to inform solution design and strategic innovation for the SRE platform
  • Contribute to the development of SRE platform capabilities, including self-healing systems and automated operational processes
  • Partner with cross-functional teams to promote adoption of SRE best practices and improve overall system reliability

Benefits

  • comprehensive benefits package
  • incentive and recognition programs
  • equity stock purchase
  • 401k contribution
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service