DevOps Engineer - Équipe de Site Reliability Engineering de BTP

SAP•Montreal, QC

3d•Hybrid

About The Position

We help the world run better. At SAP, we make it simple: you bring your best self, and we'll bring out your best. We are builders who touch over 20 industries and 80% of the world's commerce, and we need your unique talents to shape the future. The work is demanding – but it has meaning. You'll find a place here where you can be yourself, where your well-being is a priority, and where you'll truly belong. What does this bring you? Constant learning, skill development, excellent benefits, and a team that wants to see you grow and succeed. Important information: This hybrid role based at our SAP Montreal office requires 3 in-office presences per week. Candidates must be legally authorized to work in Canada at the time of application submission. This position is not eligible for employer sponsorship (e.g., LMIA or other immigration support). As a Site Reliability Engineer, you will have the opportunity to operate and support critical services for SAP and its customers. In your daily work, you will proactively monitor service behavior and identify opportunities for improvement. You will participate in the development of monitoring and troubleshooting tools for Cloud services based on the latest open-source technologies and SAP technologies, following Site Reliability Engineering principles.

Requirements

Bachelor's degree in Computer Science or a related technical field.
Experience with Kubernetes and a good understanding of containerization technologies.
Understanding of modern cloud architectures (experience with cloud platforms such as AWS, Azure, GCP is a plus).
Scripting skills, CI/CD (ArgoCD, Concourse, Github Actions are a plus) - enthusiasm for automation - making computers do the work for you.
Work effectively in emergency situations. Affinity for analyzing and resolving problems quickly within a global team.
Excellent team spirit, passionate about work, motivated and dynamic.
Excellent communication skills – precise and fact-based.
Fluent in English, basic French.
Functional fluency in English is essential for this position when based in Quebec.

Nice To Haves

Programming experience with Go, Python, Bash.
CKA/CKAD/CKS certifications.
Experience with Unix/Linux operating systems.
Experience with modern monitoring, logging, and alerting tools (Grafana, Prometheus, Kibana, Loki, Splunk On-Call, Dynatrace).
Security best practices for developing and operating cloud applications.
Participation in open-source projects.

Responsibilities

Act as a technical expert during incidents of our production services, investigate and resolve incidents at a deep technical level.
Conduct Root Cause Analyses (RCA) and follow up on improvement opportunities to prevent problems from recurring.
Perform in-depth investigations and log analysis to identify and resolve complex issues in accordance with Service Level Agreements (SLAs).
Design software solutions to improve the reliability and stability of services.
Improve infrastructure and platform monitoring by collecting system metrics (4 golden signals) and implementing tools to aid in service recovery.
Integrate and collaborate closely with development teams and work with them to implement improvements identified during post-mortems.
Stay abreast of new technologies and keep technically up-to-date.
Create and maintain technical documentation.
Define, promote, and apply Site Reliability Engineering best practices.
Be on call (rotation) to respond to alerts and prevent major incidents. On-call time benefits from a special compensation plan. We practice the 'follow the sun' approach.