AWS Cloud Site Reliability Engineer

UnitedHealth Group•Basking Ridge, NJ

10d•$72,800 - $130,000•Remote

About The Position

Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by diversity and inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health equity on a global scale. Join us to start Caring. Connecting. Growing together. You will be part of a world class identity matching solution building a state-of-the-art applications that is at the center of identity management for Optum Technology. You will have a true opportunity to change the healthcare landscape for the better. Role requires to provide 24x7 operational support to all production practices on holidays and weekends. Coordinate with various teams and raise support ticket for all issues, analyze root cause and assist in efficient resolution of all production processes. Maintain logs of all issues and ensure resolutions according to quality assurance tests for all production processes. Need to have good understanding of business processes within various systems used within the application. You will need to be ambitious and willing to work out of your comfort zone. You’ll enjoy the flexibility to work remotely from anywhere within the U.S. as you take on some tough challenges. For all hires in the Minneapolis or Washington, D.C. area, you will be required to work in the office for a minimum of four days per week.

Requirements

Bachelor’s degree OR CS OR IT related field
3+ years of experience with Cloud SDKs with AWS using Java (spring boot microservices), Scala, and Python
3+ years of experience with Distributed Data services (DynamoDB/Athena or similar)
3+ years of experience with AWS Cloud: S3, CloudWatch, ECS, Lambda, RDS, EMR, AWS ECS
3+ years of experience with CI/CD using GitHub Actions or similar

Nice To Haves

Experience in Unix, Hadoop, HBase and Hive
Experience working with offshore and onsite teams as part of job requirement
Proven good communication skills
3+ years of experience in Elastic APM
3 years with Scala
3 years with Kubernetes Clusters

Responsibilities

Lead and mentor a team of SREs to ensure high-quality delivery and professional growth
Design, build, and maintain scalable and reliable systems using cloud-native technologies
Develop and implement monitoring, alerting, and observability strategies to ensure optimal system performance and user experience
Automate operational tasks and drive infrastructure-as-code (IaC) adoption
Proactively identify and resolve reliability risks, bottlenecks, and performance issues
Leveraging AI
Collaborate with engineering and product teams on architecture, code reviews, and incident response
Lead post-incident reviews (blameless retrospectives), root cause analysis, and continuous improvement initiatives
Streamline migration processes, ensure consistency and enhance efficiency through automation, AI and innovative solutions
Define SLOs/SLIs, track error budgets, and report on system health to stakeholders
Ensure compliance and security standards are integrated into system operations
Stay current with emerging technologies and SRE best practices
Leverage enterprise-approved AI tools to streamline workflows, automate tasks, and drive continuous improvement