Principal Site Reliability Engineer - Remote

UnitedHealth GroupMinnetonka, MN
Remote

About The Position

Optum Tech is a global leader in health care innovation. Our teams develop cutting-edge solutions that help people live healthier lives and help make the health system work better for everyone. From advanced data analytics and AI to cybersecurity, we use innovative approaches to solve some of health care’s most complex challenges. Your contributions here have the potential to change lives. Ready to build the next breakthrough? Join us to start Caring. Connecting. Growing together. We are seeking a Principal Site Reliability Engineer (SRE) to define and scale reliability practices across large-scale cloud platforms. This is a senior individual contributor role focused on setting SRE standards, influencing engineering teams, and driving reliability through automation and AI-enabled operations. This is a remote role with preference for candidates located in MN. You’ll enjoy the flexibility to work remotely from anywhere within the U.S. as you take on some tough challenges. For all hires in the Minneapolis or Washington, D.C. area, you will be required to work in the office a minimum of four days per week. What Makes This Role Unique: Define and influence SRE best practices across multiple platforms and teams Drive adoption of AI-enabled reliability and operational innovation (AIOps) Work on mission-critical healthcare systems at enterprise scale Blend hands-on technical depth with strategic influence Partner across engineering, platform, and security teams to elevate reliability standards

Requirements

  • Bachelor’s Degree in Computer Science, Information Technology, or a related field, or equivalent practical experience
  • 10+ years of experience in Site Reliability Engineering, Software Engineering, or Cloud Engineering
  • Experience influencing multiple teams or platforms without direct ownership
  • Demonstrated experience improving reliability through automation, tooling, or AI-enabled approaches
  • Proven hands-on expertise in: Reliability engineering (SLOs, SLIs, incident management, observability)
  • Proven hands-on expertise in: Distributed systems in cloud environments (Azure preferred)
  • Solid understanding of system design, performance, scalability, and failure modes

Nice To Haves

  • Experience implementing AI/ML or AIOps solutions in production environments (e.g., anomaly detection, alert optimization, automation)
  • Experience standardizing observability frameworks (e.g., OpenTelemetry or similar)
  • Experience working in complex enterprise or regulated environments
  • Background supporting large-scale, mission-critical systems
  • Proven ability to influence senior technical stakeholders

Responsibilities

  • Define and drive SRE standards across teams
  • Lead implementation of: SLOs, SLIs, error budgets
  • Lead implementation of: Observability (metrics, logs, tracing)
  • Lead implementation of: Resiliency patterns (failover, self-healing)
  • Improve reliability through automation and proactive risk mitigation
  • Drive reliability practices in Azure environments
  • Apply AIOps (anomaly detection, intelligent alerting, automation)
  • Influence engineering teams without direct authority

Benefits

  • comprehensive benefits package
  • incentive and recognition programs
  • equity stock purchase
  • 401k contribution
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service