Sr. Observability Engineer

UnitedHealth Group•La Crosse, WI

38d

About The Position

Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together. OptumServe Enterprise Monitoring team is looking for an Observability Engineer. The team is responsible for enterprise infrastructure, application, and network monitoring for on-prem, hybrid, and various Clouds. The selected candidate will be joining a team of skilled engineers with a broad background in enterprise monitoring and Observability. As an Observability Engineer, this role is focused on maintaining the reliability, scalability and availability of our Log management solution as well as our Metrics and Observability platform which heavily uses automation (terraform, Ansible and scripts), this role requires maintaining performance KPI of our solutions and defining their SLOs. Primary Responsibilities: Maintain and deploy monitoring and alerting Design, configuration and maintenance of log aggregation solution at a large scale Set up and manage ingestion pipelines and data transformations Have the mindset of "automate any task" Monitoring and Alerting: Build and maintain robust monitoring systems using tools like Elk, Dynatrace, Prometheus, OTEL and Grafana to detect potential issues early and trigger alerts for timely response Maintain associated documentation as it applies to our audit and certification requirements Participate in troubleshooting, capacity planning, and performance analysis activities Research new monitoring requirements and in many cases write code for that Medium to expert level in setting up AI rules for tools like DavisAI (Dynatrace) and/or Elastic GenAI Solid expertise in setting up monitoring policies/rules/templates; and writing scripts to accomplish monitoring requirements You'll be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role as well as provide development for other roles you may be interested in.

Requirements

2+ years of experience working directly with monitoring tools as either an Admin, SME or as an Architect, preferably with Dynatrace and/or ELK
2+ years of experience with Dynatrace (managed, cloud as well as offline, with full scope of best practices and setup as it relates to Active gate, cloud, on-prem and custom with workflows), or with Elastic on-prem and cloud with best practices around the platform
1+ years of experience with designing data pipelines using filebeat, Logstash and/or fluentbit/fluentd
1+ years of AI expertise as it relates to Observability to reduce the amount of work, and make our products more reliable and resilient
1+ years of experience writing scripts in languages like Python and (Bash or powershell) to automate tasks
1+ years of experience working with Linux OS

Nice To Haves

BS/MS in CS/engineering or equivalent, OR 5+ years of experience
1+ years of experience in Terraform and Ansible. Syntax, best practices, and managing complex configurations in multi commercial and Gov clouds to build and manage infra and applications
1+ years of scripting experience (JavaScript, Java, PowerShell, or others)
SNMP, TCP dump and tracing
Knowledge of AIOPS platform

Responsibilities

Maintain and deploy monitoring and alerting
Design, configuration and maintenance of log aggregation solution at a large scale
Set up and manage ingestion pipelines and data transformations
Have the mindset of "automate any task"
Build and maintain robust monitoring systems using tools like Elk, Dynatrace, Prometheus, OTEL and Grafana to detect potential issues early and trigger alerts for timely response
Maintain associated documentation as it applies to our audit and certification requirements
Participate in troubleshooting, capacity planning, and performance analysis activities
Research new monitoring requirements and in many cases write code for that
Medium to expert level in setting up AI rules for tools like DavisAI (Dynatrace) and/or Elastic GenAI
Solid expertise in setting up monitoring policies/rules/templates; and writing scripts to accomplish monitoring requirements