Senior Cloud Site Reliability Engineer

State of Wisconsin Investment Board•Madison, WI

67d•Hybrid

About The Position

We are seeking a highly skilled and experienced Senior Site Reliability Engineer to oversee the build and transformation of SWIB’s cloud native technology stack. This role will serve in a critical capacity to ensure all aspects of the Software Development Lifecycle (SDLC) are built from the ground up with modern tools and techniques. This will include all aspects of the SDLC across software, data and infrastructure. The ideal candidate will have a strong background in financial services, exceptional leadership skills, and the ability to manage platforms that require continuity on a 24x6 basis effectively. This role will serve as a thought leader within the technology organization, helping to drive change and transformation across all teams. The Senior Site Reliability Engineer will partner with cloud-native application development teams with direct business alignment – empowering them to deliver quality software at high velocity, rapidly receive actionable feedback, and cultivating an environment of continuous experimentation. Additionally, the Senior SRE will create reusable components and workflows that deliver superior engineering experience.

Requirements

Enablement mindset and attitude – you win when teams win
Excellent verbal and written communication skills
Bachelor’s Degree in Computer Science, or a related field, or equivalent work experience
8+ years of professional Site Reliability Engineering experience (or equivalent demonstrated impact)
Strong background in designing, implementing, and delivering complex technical architectures
Hands-on experience with the Cloud in a production environment (AWS preferred)
Solid experience implementing Infrastructure-as-Code (Terraform or OpenTofu preferred)
Hands-on experience building and running CICD infrastructure (GitLab preferred)
Hands-on development experience in a modern programming language (Python preferred)
Experience implementing and operating container orchestration platforms such as Kubernetes, EKS, Elastic Container Service (ECS)
Deep understanding of information security concepts
Experience implementing and operating monitoring tools such as Sentry, Prometheus, and Datadog
Experience with or strong interest in AI technologies
Familiarity with version control systems such as Git
Working experience with agile methodologies
Ability to work under pressure and manage multiple priorities in a fast-paced environment.

Nice To Haves

Experience with cloud-performant microservices and event driven architectures is a plus

Responsibilities

Function as subject matter expert in the areas of delivering and operating reliable, performant, robust, and secure applications, with an emphasis on multi-region and multi-cloud patterns.
Work with the development teams to design, document, create and maintain highly available systems.
Design, implement, and manage centralized monitoring solutions that provide expedient actionable feedback to the development teams.
Partner with the application development teams to create flows, processes, automation, and tooling.
Facilitate the evaluation, adoption, and integration of AI into the software delivery platform
Create an environment of continuous experimentation and learning.
Contribute to evolution of our architecture (cloud-focused) to increase its flexibility and ease of use.
Follow technology trends/tools and recommend improvements to our technology when appropriate.
Mentor new or less senior members of the team.
Share experience, knowledge, and ideas to the team to improve processes and productivity.
Provide tier 2 and 3 escalations for related issues and questions.
Establish and monitor key performance indicators (KPIs) and service level agreements (SLAs) to ensure the support team meets or exceeds performance expectations.
Conduct regular performance reviews and provide ongoing training and development opportunities for the support team.
Drive continuous improvement initiatives to enhance support processes, reduce incidents, and improve overall application reliability and user satisfaction.
Manage vendor relationships and ensure third-party support services align with organizational needs and standards.
Maintain comprehensive documentation of support processes, incidents, and resolutions.
Stay current with industry trends and emerging technologies to ensure the SRE function remains cutting-edge and effective.

Benefits

Competitive total cash compensation, based on AON (formerly McLagan) industry benchmarks
Comprehensive benefits package
Educational and training opportunities
Tuition reimbursement
Challenging work in a professional environment
Hybrid work environment

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume