Ingénieur·e SRE / Site Reliability Engineer

mthree Recruiting Portal•Montreal, QC

26d•Onsite

About The Position

We are looking for someone to be a part of a dynamic team as a Site Reliability Engineer for one of our clients. Systems Reliability Engineering (SRE) is a production-oriented discipline focused on improving system service availability, observability, scalability, performance, and reliability for technology products by applying sound software engineering principles and adopting the latest technology and tooling. We would like to talk to you if you: Are interested in distributed systems and working with high scale scalable and reliable services. Like to work in a fast-moving environment and you aren't afraid to change things to make them better. Enjoy new technological challenges and solving hard problems. Believe that a team working well together is truly smarter than the single smartest person on that team. Aspire to grow as a person, as a teammate, and as an engineer. Have Grit, drive and a deep feeling of ownership.

Requirements

Background in Computer Science equivalent to a B.Sc. Equivalent practical experience is a reasonable substitute.
Must have experience with kubernetes and management of containerized applications
Automation-related experience is particularly valued using scripting languages such as python, bash, Perl. One higher level language is desired.
Experience on supporting three tier architecture which includes exposure to UNIX, Linux platforms and databases such IBM DB2, Sybase, Mongo, GreenPlum etc.
Experience with source code and binary repositories, build tools, and CI/CD (Git, Artifactory, Jenkins, Docker) etc and data streaming technologies like Spark, Kafka etc.
Hands on experience on enterprise tools set such as Grafana, Dynatrace, AppDynamics etc.
Awareness of, and ability to reason about modern software & systems architectures, including load-balancing, queueing, caching, distributed systems failure modes, micro services etc
Deep understanding of operating system level concepts such as processes, memory allocation, and the network stack; understanding of how applications are affected by the above, and ability to debug same.
Generally speaking, practical experience running large scale online systems is always an advantage.

Responsibilities

Working closely with engineering/development teams to design, build, and maintain systems and help them decide on products to use, schema design and query tuning.
Troubleshoot issues across the entire stack: hardware, software, application and network.
Identifying and drive opportunities to improve automation for our platforms; scope and create automation for deployment, management and visibility of our services.
Proactively identifying and addressing systems reliability risks.
Represent the RPE organization in design reviews and operational readiness exercises for new and existing services.
Working alongside existing global and regional team members on a follow-the-sun basis.
Participate in on-call rotation and periodic conference calls with other specialists from other time zones.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume