Principal Site Reliability Engineer - Remote

UnitedHealth Group•Minnetonka, MN

3d•Remote

About The Position

Optum Tech is a global leader in health care innovation. Our teams develop cutting-edge solutions that help people live healthier lives and help make the health system work better for everyone. From advanced data analytics and AI to cybersecurity, we use innovative approaches to solve some of health care’s most complex challenges. Your contributions here have the potential to change lives. Ready to build the next breakthrough? Join us to start Caring. Connecting. Growing together. We are seeking a Principal Site Reliability Engineer (SRE) to define and scale reliability practices across large-scale cloud platforms. This is a senior individual contributor role focused on setting SRE standards, influencing engineering teams, and driving reliability through automation and AI-enabled operations. This is a remote role with preference for candidates located in MN. You’ll enjoy the flexibility to work remotely from anywhere within the U.S. as you take on some tough challenges. For all hires in the Minneapolis or Washington, D.C. area, you will be required to work in the office a minimum of four days per week. What Makes This Role Unique: Define and influence SRE best practices across multiple platforms and teams Drive adoption of AI-enabled reliability and operational innovation (AIOps) Work on mission-critical healthcare systems at enterprise scale Blend hands-on technical depth with strategic influence Partner across engineering, platform, and security teams to elevate reliability standards

Requirements

Bachelor’s Degree in Computer Science, Information Technology, or a related field, or equivalent practical experience
10+ years of experience in Site Reliability Engineering, Software Engineering, or Cloud Engineering
Experience influencing multiple teams or platforms without direct ownership
Demonstrated experience improving reliability through automation, tooling, or AI-enabled approaches
Proven hands-on expertise in: Reliability engineering (SLOs, SLIs, incident management, observability)
Proven hands-on expertise in: Distributed systems in cloud environments (Azure preferred)
Solid understanding of system design, performance, scalability, and failure modes

Nice To Haves

Experience implementing AI/ML or AIOps solutions in production environments (e.g., anomaly detection, alert optimization, automation)
Experience standardizing observability frameworks (e.g., OpenTelemetry or similar)
Experience working in complex enterprise or regulated environments
Background supporting large-scale, mission-critical systems
Proven ability to influence senior technical stakeholders