Ingénieur.e de Fiabilité Senior.e - Fiabilité des Produits | Senior Site Reliability Engineer - Product Reliability

Matador•Laval, QC

9d•CA$130,000 - CA$150,000•Remote

About The Position

We're seeking a Senior Site Reliability Engineer – Product Reliability to help scale, operate, and improve the reliability of our AI-powered communication platform. This role sits at the intersection of software engineering, infrastructure, operations, and product support. You'll be responsible for ensuring the stability, scalability, and performance of systems powering thousands of real-time interactions across distributed, event-driven architectures. You'll also serve as the first layer of technical investigation for production issues and product-related failures, partnering closely with engineering teams to identify root causes, improve observability, and drive long-term reliability improvements. This is a highly technical, hands-on role for someone who enjoys debugging complex systems, improving operational excellence, and building reliable infrastructure at scale.

Requirements

5+ years of experience in Site Reliability Engineering, Production Engineering, Backend Engineering, or related roles
Strong hands-on experience with Node.js and TypeScript in production environments
Proven experience operating and troubleshooting distributed systems and microservices architectures
Experience managing production workloads on AWS, including ECS, Lambda, SQS, and API Gateway
Hands-on experience with Kafka, AWS SQS, or other messaging/event-streaming systems
Strong understanding of observability, monitoring, alerting, and incident response best practices
Experience debugging complex production issues across application, infrastructure, and networking layers
Deep understanding of system reliability concepts including concurrency, async workflows, resiliency, fault tolerance, and eventual consistency
Experience with MongoDB and Redis in high-scale production environments
Ability to analyze logs, traces, metrics, and system behavior to identify root causes efficiently
Strong communication skills and ability to collaborate across engineering, product, and support teams
Experience mentoring engineers and contributing to operational excellence initiatives

Nice To Haves

Kubernetes and container orchestration in production
Broader AWS infrastructure experience (networking, infrastructure-as-code, observability, cost optimization)
Experience with relational databases such as PostgreSQL
Experience developing load tests, resilience tests, and chaos engineering exercises
Prior customer support experience or direct work with customers to understand business impact

Responsibilities

Serve as the first line of technical investigation for production incidents, product failures, and performance issues
Analyze logs, traces, metrics, and system behavior to identify root causes efficiently and implement solutions
Partner closely with backend engineering and DevOps teams to diagnose issues impacting stability, latency, and reliability
Design and implement observability improvements, including monitoring, alerting, and structured logging across distributed systems
Establish and improve incident response processes, including escalation procedures, post-mortem analysis, and prevention of recurring incidents
Participate in architectural design of backend services, event-driven systems, and asynchronous messaging pipelines to ensure reliability and disaster recovery
Optimize performance and resilience of systems operating under high load, powering thousands of real-time interactions
Develop and maintain operational documentation, runbooks, and dashboards to support production operations
Collaborate with product and customer support teams to understand business impact and prioritization
Mentor junior engineers on reliability best practices and resilient design principles