Site Reliability Engineer

UnitedHealth Group•Eden Prairie, MN

7d•Remote

About The Position

Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by diversity and inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health equity on a global scale. Join us to start Caring. Connecting. Growing together. You’ll enjoy the flexibility to work remotely from anywhere within the U.S. as you take on some tough challenges. For all hires in the Minneapolis or Washington, D.C. area, you will be required to work in the office for a minimum of four days per week. Primary Responsibilities: Design and operate Kubernetes-based infrastructure with moderate independence to support reliable and scalable applications Build and improve CI/CD pipelines in GitHub Actions to reduce manual steps and increase deployment reliability Use Terraform to provision, update, and manage GCP infrastructure using best practices Manage Kafka clusters and pipelines or equivalent streaming systems (e.g., Pulsar, Pub/Sub, Kinesis), including performance tuning and troubleshooting Develop dashboards, alerts, and reliability improvements using Prometheus and Grafana Partner with development teams to automate workflows and enhance IaC standards Write Python automation tools that improve operational efficiency Troubleshoot distributed system issues and participate in root-cause analysis Fully participate in on-call rotations and lead smaller-scale incident responses AI Builder: Design, develop, and deploy AI-powered solutions using no-code, low-code, and advanced platforms, translating business needs into scalable applications that enhance products, workflows and decision-making You’ll be rewarded and recognized for your performance in an environment that will challenge you and give you clear directions on what it takes to succeed in your role as well as provide development for other roles you may be interested in.

Requirements

High School Diploma/GED (or higher)
3+ years of experience with Google Cloud Platform or another major cloud provider with willingness to ramp quickly
3+ years of experience building and maintaining CI/CD pipelines, preferably using GitHub Actions
2+ years of experience troubleshooting distributed systems and working with observability platforms such as Prometheus, Grafana, Datadog, or equivalent
Intermediate level of knowledge with Kubernetes, including deploying, scaling, and operating containerized applications
Intermediate level of experience working with Terraform or similar infrastructure-as-code toolsets
Intermediate level of experience in Python for automation, scripting, and tooling

Nice To Haves

Experience with Kafka or other distributed streaming platforms (e.g., Pulsar, Kinesis, Pub/Sub)
Familiarity with Helm for Kubernetes package management
Exposure to cloud security best practices and system hardening
Experience optimizing distributed systems and microservices architectures
Working knowledge of Java to support troubleshooting backend services
Familiarity with DataHub or other metadata management platforms
Exposure to AI/ML tooling, platforms, or MLOps workflows
Golang experience for building cloud-native tools

Responsibilities

Design and operate Kubernetes-based infrastructure with moderate independence to support reliable and scalable applications
Build and improve CI/CD pipelines in GitHub Actions to reduce manual steps and increase deployment reliability
Use Terraform to provision, update, and manage GCP infrastructure using best practices
Manage Kafka clusters and pipelines or equivalent streaming systems (e.g., Pulsar, Pub/Sub, Kinesis), including performance tuning and troubleshooting
Develop dashboards, alerts, and reliability improvements using Prometheus and Grafana
Partner with development teams to automate workflows and enhance IaC standards
Write Python automation tools that improve operational efficiency
Troubleshoot distributed system issues and participate in root-cause analysis
Fully participate in on-call rotations and lead smaller-scale incident responses
AI Builder: Design, develop, and deploy AI-powered solutions using no-code, low-code, and advanced platforms, translating business needs into scalable applications that enhance products, workflows and decision-making