Ingénieur·e en fiabilité des sites - Site Reliability Engineer

TobogganLabs•Montreal, QC

8d•Remote

About The Position

We're seeking a Site Reliability Engineer (SRE) to help our clients build reliable, observable, and secure production systems. In this role, you will work closely with client engineering and operations teams to improve system reliability, reduce toil, and build the operational foundations — deployment pipelines, monitoring, incident management, and infrastructure — that keep production systems running smoothly. Note that while we specialize in healthcare and regulated industries, not all our projects are in these fields, so you may work across different domains from time to time.

Requirements

Have 5+ years of experience in infrastructure, DevOps, or site reliability engineering
Have hands-on experience with AWS or Azure infrastructure and infrastructure-as-code tools (Terraform, CloudFormation, or equivalents)
Have strong experience with CI/CD pipelines (GitHub Actions, ArgoCD, Jenkins, or equivalents) and deployment automation
Have experience with observability tools (Prometheus, Grafana, Datadog, CloudWatch, or equivalents) and incident management processes
Are familiar with security best practices for cloud infrastructure, including network security, IAM, encryption, and vulnerability management
Have excellent communication skills and can explain infrastructure and reliability concepts to varied stakeholders
Are adaptable, self-directed, and comfortable in dynamic client environments
Can explain reliability and security trade-offs and connect them to business needs.

Nice To Haves

Experience in client-facing roles such as consulting, implementation engineering, or advisory work.
Worked in healthcare or other heavily regulated industries.
Software development experience beyond scripting — experience building features, APIs, or applications.
Experience with container orchestration (Kubernetes, ECS) and cloud-native tooling.
Built infrastructure automation using scripting (Python, Bash) or workflow tools.
Hold relevant certifications (AWS DevOps Professional, AWS Solutions Architect, CKA, or similar).

Responsibilities

Design and maintain resilient, secure cloud infrastructure using infrastructure-as-code; implement security controls, hardening standards, and compliance guardrails across client environments.
Design and implement monitoring, alerting, and logging systems; lead incident response and post-mortem processes; define and track SLOs and SLIs.
Automate deployment pipelines, infrastructure provisioning, and operational runbooks to reduce toil and improve system resilience.
Own the reliability and infrastructure workstream, guide client engineering teams on SRE practices, and contribute to architectural decisions.
Share SRE expertise with colleagues, contribute to internal tooling and documentation, mentor team members, and participate in the broader Toboggan community.

Benefits

Home office/technology budget
Yearly professional development budget
Company matching RRSP after 1 year
100% employer-paid health & dental insurance including a yearly bank of coverage for complementary medicine (Acupuncture, osteopathy, massage therapy, naturopathy, psychology, etc.)
Life, long & short-term disability insurance
Parental leave top-up (8 weeks), available to employees with 1+ year of tenure, regardless of path to parenthood.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume