SRE specialist

Intact FC•Montreal, QC

21h•Hybrid

About The Position

We are seeking a hands-on Site Reliability Engineer within the Intelligent Operations Department’s SRE & Resiliency team. This role operates across Azure, AWS, GCP, and on‑prem environments, embedded in the broader enterprise resiliency and production reliability strategy. The SRE will function as part of a special investigations unit that empowers and enables Applicative Support, Infrastructure Support, and the Incident Management team—coaching, guiding, and leading investigations into active incidents and proactive reliability improvements. Core responsibilities include deep investigations, advanced observability (OpenTelemetry, Dynatrace, Elastic), auto-healing tooling, SLI/SLO stewardship, and business-aligned reliability reporting.

Requirements

8+ years of experience in SRE/Platform/Infrastructure/Software Engineering operating large-scale production systems across multi-cloud and on‑prem.
Strong proficiency in: Observability: OpenTelemetry instrumentation and standards; Dynatrace (Davis AI, SmartScape, service-level analysis, baselining); Elastic/ELK (Beats/Agent, ingest pipelines, ILM, Kibana).
Reliability engineering: SLIs/SLOs/SLAs, error budgets, alert strategy, capacity modeling, graceful degradation, circuit breaking, retries/backoff.
CI/CD and deployment patterns: blue/green, canary, progressive delivery, automated rollback, pipeline safeguards.
Kubernetes and service meshes; platform-level resilience and operability.
Data and event systems: replication, snapshots/PITR, CDC, streaming (Kafka, RabbitMQ, Pub/Sub) with DLQs/reprocessing; dependency risk management.
Networking and traffic: DNS, load balancers, CDN/edge, TLS/mTLS; fundamentals of BGP and global traffic management.
Solid software engineering skills in at least one of: Go, Python, or TypeScript; experience with IaC (Terraform), GitOps (Argo CD/Flux), and policy-as-code.
Experience running chaos engineering, game days, and DR exercises; ability to design safe experiments and embed learnings into production hardening.
Excellent communication (written, visual, verbal); adept at coaching, leading investigations, and presenting to mixed technical/business audiences.
Bilingual (French and English): Need to interact on a regular basis with an English-speaking clientele and colleagues across the country.
Must be eligible to work in Canada.

Nice To Haves

No Canadian work experience required.

Responsibilities

Lead high‑severity investigations and RCA with App/Infra/Incident teams.
Proactively find systemic risks and resilience gaps; drive durable fixes.
Run blameless post‑mortems and coach teams.
Implement end‑to‑end traces/metrics/logs with consistent semantics.
Build insights and anomaly detection; create topology‑aware health models.
Integrate synthetics, contract tests, and distributed tracing.
Build policy‑driven remediation (circuit breakers, throttling, retries).
Enable progressive delivery (blue/green, canary) with safe rollbacks.
Provide resilience tooling: validation, safeguards, chaos, DR, runbooks.
Define user‑centric SLIs/SLOs; enforce error budget policies.
Publish reliability reports and scorecards; drive continuous improvement.
Upskill support/incident teams; standardize playbooks and training.
Promote automation‑first, data‑driven, resilience culture.
Operate across Azure/AWS/GCP/on‑prem; GLB, DNS, TLS, CDN, failover.
Improve K8s/mesh (AKS/EKS/GKE, Istio/Linkerd) and data/streaming resilience.
Use AI for causal detection/anomalies to cut MTTR.
Develop reliability copilots; monitor AI systems for reliability and cost.

Benefits

Flexible work arrangements and a hybrid work model
Possibility to purchase up to 5 extra days off per year
Multiple benefits offered to support physical and mental wellbeing, including telemedicine, Wellness account and much more
Share plan & other savings: up to 12% of salary or even more (ask how you could earn guaranteed income for life)
Annual bonus target, based on the base salary, with a potential payout of up to double the target (subject to personal and company performance): 15%
Employee Share Purchase Plan (ESPP) – with Intact matching 50% of your net shares.
Defined benefit pension plan

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume