Engineer III

MAPFRE•Webster, MA

1d•Onsite

About The Position

The Site Reliability Engineer (SRE) is a critical part of our Mapfre USA On-Prem and Cloud platform strategy. In this role, you will be focused on ensuring MUSA’s development platform and processes enable our software engineers to focus more on innovation than infrastructure. This role will drive the adoption of observability best practices and develop automations for resolving recurring issues. You must be comfortable working with software engineering teams and supporting their demanding needs to ensure the security, availability, and performance of the platform. This engineer must be capable of triaging issues on the front line as well as framing strategic initiatives from leadership. Being hands-on keyboard is a must for this role with a focus on developing reliability engineering for MUSA Platforms.

Requirements

6 or more years of work experience with a Bachelors Degree or 4 or more years of relevant experience with an Advanced Degree.
Hands-on experience in Linux and Windows systems and good understanding of distributed computing environments.
Intermediate level programming and/or scripting in 3 or more of the following: Python, PowerShell, JavaScript, Terraform, Ansible, etc.
2+ years of experience managing CI/CD tooling such as Jenkins, Github, Bitbucket, DevOps in a large-scale environment.
3+ Years’ experience managing observability tooling such as Splunk, Dynatrace, etc. in a large-scale environment.
Advanced understanding of YAML, JSON, HTML, XML.
2+ years of work experience supporting relational and non-relational databases [MySQL, MongoDB, PostgreSQL, etc.), including creating and running queries, managing performance and scaling.
3 or more years leading a Platform, SRE or Production Engineering group for high availability/critical platforms/applications.
Experience managing a distributed platform including but not limited to deployment/release management, provisioning, capacity management, workload management.
A site reliability engineer uses a service to monitor performance metrics and detect anomalous application behavior. If there are issues with the application, the SRE team submits a report to the software engineering team. The developers fix the reported cases and publish the updated application.

Nice To Haves

Master’s Degree in IT, CS or related field preferred and/or 5+ years relevant work experience.

Responsibilities

Set standards for the monitoring of MUSA on-prem and Cloud infrastructure and applications.
Ensure the platform target SLAs are met and implement appropriate SLIs for supporting services.
As a key member of the Critical Incident Response team, use expert communication and troubleshooting skills to aid the team in an efficient resolution.
Work with developers during service transition, evaluating reliability and operability of the applications and ensuring adequate monitoring, alerting and observability.
Partner with peers within Operations & Infrastructure supporting ongoing maintenance and enhancement of the platform.
Focus on setting standards for automating routine tasks and workflows in supporting Infrastructure and Engineering teams.
Support multiple internal stakeholders with a variety of technical challenges.
Analyze and discern patterns in the variety of issues that arise and propose solutions to these problems.
Work in a 24/7/365 operation model, which will require working in a shift or on-call support model.