We are seeking a highly skilled site reliability engineer (SRE) to join our IT Operations team. The site reliability engineer (SRE) role is responsible for enabling innovation and velocity of change while ensuring system reliability focusing on the critical features and functionality within products and platforms. It collaborates with the business or product owners to prioritize operational requirements by defining service-level indicators (SLIs) and service-level objectives (SLOs) to monitor and optimize customer's journey and experience. Our goal is to improve the stability of existing platforms and in parallel design and operate scalable resilient systems utilizing modern software engineering principles. In the role you will analyze service management incident management, problem management, change management, and release management date to identify persistent problems. You will then improve monitoring and observability and implement corrective actions. You are also encouraged to recommend changes to our architecture to increase performance and stability. Successful reliability outcomes are likely to implement and extend on DevOps and Agile ways of working and associated automation approaches. These are underpinned by the site reliability engineer’s solid understanding of systems, production environments, operational insights, incident management, on-premises, cloud and hybrid world. The nature of the work involved means that the site reliability engineer will directly engage with customer teams but will also work on reliability initiatives that span multiple teams. The site reliability engineer collaborates closely with product owners and teams, architects, IT service management, software developers, security and network engineers, as well as other subject matter experts and roles, particularly in infrastructure and operations. Being an approachable team player and a good communicator is therefore crucial for success, and a willingness to lead initiatives is important. The site reliability engineer leads root cause analysis in areas such as deployment activities, event management, incident and problem management, availability, capacity and service-level management, as well as service continuity and scalability.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level