Manager, Site Reliability Engineering and DevOps

Moneris•Toronto, ON

7d•CA$142,000 - CA$186,000•Hybrid

About The Position

The Manager, Site Reliability Engineering (SRE) leads teams and practices that ensure the availability, performance, and resiliency of Moneris’ critical platforms and services. You will oversee SRE delivery across assigned domains while embedding reliability principles throughout the service lifecycle. In this role, you will lead a team of SRE engineers supporting production systems across cloud and on‑premises environments, driving service reliability through Service Level Objectives (SLOs), observability, and automation. You will partner closely with Development, DevOps, Infrastructure, and Security teams to reduce operational toil, improve incident response, and scale reliability practices. This is a high‑impact leadership role where you will shape SRE maturity, influence engineering standards, and ensure consistent, measurable improvements in system health and operational excellence.

Requirements

8+ years of experience in senior technical roles supporting distributed systems.
3+ years of experience leading and developing technical teams.
Strong knowledge of Site Reliability Engineering principles, including SLOs, SLIs, and error budgets.
Hands‑on experience with cloud platforms (Azure preferred), Kubernetes, infrastructure as code, and automation.
Experience with enterprise observability tools such as Dynatrace, Datadog, New Relic, or AppDynamics.
Strong scripting or programming skills in support of automation and operational efficiency.
Solid understanding of Linux‑based systems and production infrastructure.

Nice To Haves

Experience in payment processing, fintech, or PCI‑regulated environments.
Familiarity with change management and compliance frameworks.
Understanding of Software Development Life Cycle (SDLC) and modern delivery practices.
Bachelor’s degree in Computer Science, Software Engineering, or equivalent experience.

Responsibilities

Lead and manage SRE engineers supporting the reliability, availability, and performance of business‑critical applications and platforms.
Implement and operationalize SRE practices, including SLIs, SLOs, error budgets, incident response, and post‑incident reviews.
Oversee production operations, including on‑call rotations, incident management, escalations, and problem management.
Partner with Development and DevOps teams to embed reliability principles into system design and delivery pipelines.
Drive observability strategy through monitoring, logging, and alerting standards across services.
Reduce operational toil through automation and self‑healing system design.
Lead capacity planning, resiliency testing, and disaster recovery readiness initiatives.
Recruit, mentor, and develop SRE talent while fostering a culture of continuous improvement and accountability.

Benefits

balancing in-office collaboration with remote flexibility
Total compensation may also include variable or discretionary incentive components, including but not limited to bonuses and commissions.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume