Practice Technical Manager

Datavail•Canada,

About The Position

This role involves leading the Site Reliability Engineering (SRE) practice, focusing on daily operations, incident management, and technical oversight to ensure system resilience and reliability. The Practice Technical Manager will also engage in cross-functional leadership, translating business priorities into reliability roadmaps and supporting customer-facing discussions. The core responsibilities are divided into three main areas: Team Leadership & Operational Management, Technical Oversight, and Cross-Functional Leadership. In Team Leadership, the focus is on running daily operations, maintaining a healthy on-call program, overseeing incident management, establishing operational KPIs, coaching SREs, and ensuring documentation is current. Technical Oversight includes providing architecture-level guidance on resilience and observability, validating SLIs/SLOs, reviewing reliability design work, participating in high-severity incidents, and ensuring engineering quality for IaC, CI/CD, and Kubernetes operations. Cross-Functional Leadership involves acting as a primary point of contact for internal stakeholders, translating business priorities into reliability roadmaps, aligning teams around shared reliability objectives, and supporting customer-facing conversations.

Requirements

6–10 years in SRE/Operations/Platform roles, with at least 2 years leading or managing engineers.
Hands-on technical background across cloud platforms (AWS/Azure/GCP) and Kubernetes.
Experience defining and operating SLIs/SLOs, incident response, and postmortem programs.
Strong grounding in Terraform or similar IaC, CI/CD systems, and observability technologies (Prometheus, Grafana, OpenTelemetry, ELK).
Ability to assess technical work, coach engineers through complex problems, and make informed trade-offs under pressure.
Excellent operational judgment: triage, prioritization, team load balancing, and process design.
Cloud provider certification: Professional-level certification in AWS (Solutions Architect), Azure (Solutions Architect Expert), GCP (Professional Cloud Architect), or Oracle Cloud (Architect Professional)

Nice To Haves

Prior experience running a distributed or follow-the-sun SRE practice.
Exposure to chaos engineering, fault injection, or reliability stress testing.
Familiarity with cloud cost governance and rightsizing strategies.
Experience improving or scaling on-call systems.

Responsibilities

Run the daily operations of the SRE practice: team planning, shift assignments, escalation routing, and workload balancing.
Maintain a healthy on-call program: define rotation rules, track fatigue, ensure coverage, and continuously improve response maturity.
Oversee incident management processes—ensuring consistent triage, high-quality postmortems, and follow-through on remediation work.
Establish operational KPIs for the team (MTTA, MTTR, on-call load, ticket aging, toil reduction) and drive accountability.
Coach and develop SREs at all levels through 1:1s, technical guidance, and structured growth plans.
Ensure the team’s processes, documentation, and runbooks stay current and audited.
Provide architecture-level guidance on resilience, observability, and reliability patterns; step in directly when the team is blocked or customer-impacting work demands senior technical judgment.
Validate SLIs/SLOs and error budgets across services; ensure consistent implementation and reporting.
Review and approve reliability design work—monitoring strategies, automation initiatives, CI/CD changes, deployment safety controls, and cloud cost/performance optimizations.
Participate in high-severity incidents as escalation point and technical lead when needed.
Ensure engineering quality for IaC, CI/CD, observability instrumentation, and Kubernetes platform operations.
Act as primary point of contact for internal stakeholders (Dev, Product, Architecture, Cloud) regarding reliability strategy and prioritization.
Translate business priorities into reliability roadmaps, staffing plans, and operational improvements.
Align teams around shared reliability objectives—ensuring corrective actions, automation priorities, and capacity planning are actually executed.
Support customer-facing conversations when reliability posture, operational processes, or technical improvements require leadership representation.