Sr. Site Reliability Engineer, IE&O

''•Toronto, ON

11d•CA$102,700 - CA$137,000•Onsite

About The Position

The Site Reliability Engineer will help improve the reliability, availability, performance, and operability of McCain’s critical technology platforms. This role will design resilient cloud-native systems, embed observability into applications and infrastructure, automate operational workflows, and help scale SRE practices across engineering and platform teams. As part of the global SRE practice, this role will also help build and drive McCain’s AIOps capabilities by improving telemetry, alert quality, event correlation, incident automation, and proactive reliability insights.

Requirements

9+ years of experience in software engineering, platform engineering, cloud engineering, DevOps, production engineering, or site reliability engineering.
Strong hands-on experience with Azure, Kubernetes, containers, APIs, distributed systems, and modern deployment patterns.
Strong scripting or software engineering experience using Python, Go, PowerShell, Bash, Java, or similar languages.
Experience with observability, including metrics, logs, traces, dashboards, alerts, OpenTelemetry, and telemetry-driven reliability practices.
Experience with Infrastructure as Code, CI/CD, automation, and deployment tooling such as Terraform, Bicep, GitHub Actions, Azure DevOps, or similar technologies.
Good understanding of SLOs, SLIs, Error Budgets, resiliency patterns, incident management, production readiness, and capacity planning.
Strong troubleshooting, communication, and stakeholder influencing skills.
Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field.
Azure certifications are preferred.

Nice To Haves

Experience with AIOps, event correlation, alert enrichment, noise reduction, automated triage, or incident automation.
Experience using AI-assisted capabilities for incident triage, root cause analysis, knowledge management, operational automation, or engineering productivity.
Experience building self-service platforms, reusable automation frameworks, golden paths, or internal developer platforms.

Responsibilities

Design, build, and improve reliable, scalable, and secure systems across Azure cloud and hybrid environments.
Embed observability into applications and platforms using metrics, logs, traces, dashboards, alerts, and OpenTelemetry standards.
Build and drive AIOps capabilities by improving alert quality, event correlation, incident enrichment, noise reduction, automated triage, and operational automation.
Partner with engineering teams to define SLOs, SLIs, Error Budgets, production readiness standards, and reliability scorecards.
Build automation to reduce toil across infrastructure, deployments, incident response, monitoring, and operational workflows.
Use Infrastructure as Code, CI/CD pipelines, scripting, and self-healing patterns to improve reliability and delivery speed.
Support incident response, root cause analysis, postmortems, escalation workflows, and continuous reliability improvements.
Troubleshoot complex issues across application, infrastructure, cloud, network, database, and integration layers.
Build reusable SRE playbooks, standards, templates, and automation patterns for broader enterprise adoption.
Collaborate with developers, platform teams, operations teams, vendors, and stakeholders to improve system reliability and operational maturity.