Site Reliability Engineer

UnitedHealth GroupEden Prairie, MN
7dRemote

About The Position

Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by diversity and inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health equity on a global scale. Join us to start Caring. Connecting. Growing together. You’ll enjoy the flexibility to work remotely from anywhere within the U.S. as you take on some tough challenges. For all hires in the Minneapolis or Washington, D.C. area, you will be required to work in the office for a minimum of four days per week. Primary Responsibilities: Design and operate Kubernetes-based infrastructure with moderate independence to support reliable and scalable applications Build and improve CI/CD pipelines in GitHub Actions to reduce manual steps and increase deployment reliability Use Terraform to provision, update, and manage GCP infrastructure using best practices Manage Kafka clusters and pipelines or equivalent streaming systems (e.g., Pulsar, Pub/Sub, Kinesis), including performance tuning and troubleshooting Develop dashboards, alerts, and reliability improvements using Prometheus and Grafana Partner with development teams to automate workflows and enhance IaC standards Write Python automation tools that improve operational efficiency Troubleshoot distributed system issues and participate in root-cause analysis Fully participate in on-call rotations and lead smaller-scale incident responses AI Builder: Design, develop, and deploy AI-powered solutions using no-code, low-code, and advanced platforms, translating business needs into scalable applications that enhance products, workflows and decision-making You’ll be rewarded and recognized for your performance in an environment that will challenge you and give you clear directions on what it takes to succeed in your role as well as provide development for other roles you may be interested in.

Requirements

  • High School Diploma/GED (or higher)
  • 3+ years of experience with Google Cloud Platform or another major cloud provider with willingness to ramp quickly
  • 3+ years of experience building and maintaining CI/CD pipelines, preferably using GitHub Actions
  • 2+ years of experience troubleshooting distributed systems and working with observability platforms such as Prometheus, Grafana, Datadog, or equivalent
  • Intermediate level of knowledge with Kubernetes, including deploying, scaling, and operating containerized applications
  • Intermediate level of experience working with Terraform or similar infrastructure-as-code toolsets
  • Intermediate level of experience in Python for automation, scripting, and tooling

Nice To Haves

  • Experience with Kafka or other distributed streaming platforms (e.g., Pulsar, Kinesis, Pub/Sub)
  • Familiarity with Helm for Kubernetes package management
  • Exposure to cloud security best practices and system hardening
  • Experience optimizing distributed systems and microservices architectures
  • Working knowledge of Java to support troubleshooting backend services
  • Familiarity with DataHub or other metadata management platforms
  • Exposure to AI/ML tooling, platforms, or MLOps workflows
  • Golang experience for building cloud-native tools

Responsibilities

  • Design and operate Kubernetes-based infrastructure with moderate independence to support reliable and scalable applications
  • Build and improve CI/CD pipelines in GitHub Actions to reduce manual steps and increase deployment reliability
  • Use Terraform to provision, update, and manage GCP infrastructure using best practices
  • Manage Kafka clusters and pipelines or equivalent streaming systems (e.g., Pulsar, Pub/Sub, Kinesis), including performance tuning and troubleshooting
  • Develop dashboards, alerts, and reliability improvements using Prometheus and Grafana
  • Partner with development teams to automate workflows and enhance IaC standards
  • Write Python automation tools that improve operational efficiency
  • Troubleshoot distributed system issues and participate in root-cause analysis
  • Fully participate in on-call rotations and lead smaller-scale incident responses
  • AI Builder: Design, develop, and deploy AI-powered solutions using no-code, low-code, and advanced platforms, translating business needs into scalable applications that enhance products, workflows and decision-making

Benefits

  • a comprehensive benefits package
  • incentive and recognition programs
  • equity stock purchase
  • 401k contribution
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service