Site Reliability Engineer

Scotiabank•Toronto, ON

29d•Onsite

About The Position

We’re looking for an SRE with deep experience in production observability and incident response to raise the reliability and transparency of our customer-facing services. You will own the end-to-end observability stack across Dynatrace, Splunk, Power BI, and Google Cloud (GCP) Monitoring, drive proactive detection and reduction of toil, and lead major incident response. This role focuses on operational excellence and service health and NOT platform engineering or DevOps provisioning. Note: This role does NOT manage CI/CD, infrastructure provisioning, or platform build (Terraform/Kubernetes cluster ops). Collaboration with those teams is expected, but ownership remains on monitoring, analytics, incident response, and reliability outcomes.

Requirements

5+ years in SRE/Production Operations/Observability with Dynatrace and Splunk in high-availability environments.
Hands-on with GCP operations: Cloud Monitoring, Cloud Logging, Alerting Policies, Uptime Checks, SLOs/SLIs; familiarity with Error Reporting/Trace is a plus.
Strong SPL (Splunk) and Dynatrace (APM/RUM/Synthetic) expertise—including alert design, dashboards, and noise reduction.
Power BI proficiency: data modeling, DAX measures, role-level security, and scheduled refresh for operational/Exec reporting.
Proven incident commander experience for Sev1/Sev2 with clear comms, stakeholder management, and PIR facilitation.
Coding/scripting for automation and data manipulation (e.g., Python or PowerShell; Go/Bash a plus).
Solid understanding of service reliability concepts: golden signals, SLOs/error budgets, capacity and saturation, graceful degradation.
Strong analytical mindset with a bias to measurable outcomes (MTTD/MTTR, alert volume, SLO compliance).

Nice To Haves

Familiarity with Error Reporting/Trace is a plus.
Go/Bash a plus.

Responsibilities

Design and maintain end-to-end monitoring for critical services using Dynatrace (APM, Real User Monitoring, Synthetic, Davis AI, Smartscape) and GCP Cloud Monitoring (metrics, alerting policies, SLOs/SLIs, uptime checks, dashboards).
Build service maps, dependency models, and problem detection in Dynatrace; tune Davis AI problem rules and reduce alert noise through thresholds, baselining, and tagging.
Implement SLOs/SLIs with error budgets; continuously review burn rates and align alerting to customer impact.
Partner with application teams to instrument code paths (e.g., Dynatrace OneAgent), trace distributed transactions, and validate golden signals (latency, traffic, errors, saturation).
Create and optimize Splunk data models, indexes, sourcetypes, ingestion pipelines, and SPL searches; build actionable dashboards for NOC/SRE/Engineering.
Develop operational analytics and executive reporting in Power BI (data modeling, DAX/Measures, scheduled refresh) to track reliability KPIs, incident trends, MTTR/MTTD, SLO compliance, and capacity signals.
Establish governance for data quality, field extractions, and retention to ensure fast, accurate investigations.
Lead incident response (Sev1/Sev2): run bridges, coordinate SMEs, communicate status/timelines, drive mitigation and customer updates.
Maintain runbooks, decision trees, and standard operating procedures; ensure blameless post-incident reviews (PIRs) with clear RCA, corrective actions, and preventative measures.
Track and close problem tickets tied to recurring failure modes; verify effectiveness of fixes via metrics and error budgets.
Use light coding/scripting to automate recurring tasks: alert tuning, data enrichment, log parsing, playbook triggers, service health checks.
Build small utilities or bots for on-call workflows (e.g., auto-triage, context gathering, incident timelines).
Contribute to observability standards and best practices (naming, tags, SLIs, alert policies), and mentor teams on instrumenting for reliability.

Benefits

Diversity, Equity, Inclusion & Allyship
Accessibility and Workplace Accommodations
Upskilling through online courses, cross-functional development opportunities, and tuition assistance.
Competitive Rewards program including bonus, flexible vacation, personal, sick days and benefits will start on day one.
Free tea & coffee, universal washrooms, and lots of space for team collaboration.
Opportunities for community engagement & belonging with our various programs.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume