Manager, Platform & Site Reliability

CIRA•Ottawa, ON

1d•CA$135,000 - CA$150,000•Hybrid

About The Position

Join the team that is building a trusted internet for Canadians! CIRA may be best known for managing the .CA domain but our impact reaches far beyond that. We’re at the forefront of advancing cybersecurity technologies and leading projects that improve the digital experience for users across Canada and the world. Our broad scope of activities is driven by one central goal: to strengthen and secure Canada’s digital landscape. By working with the CIRA registry team, you’ll play a part in advancing the CIRA Registry Platform, which supports a wide range of domains globally. Help us drive innovation and maintain the high standards of stability and security that our platform is known for. Join us in advancing digital identity and technology in Canada and beyond.

Requirements

7+ years of progressive experience in Site Reliability Engineering (SRE), platform engineering, DevOps, infrastructure, or cloud operations, including hands-on experience with public cloud platforms such as AWS.
3+ years of experience leading, coaching, and developing technical teams in SRE, platform engineering, DevOps, infrastructure, or cloud operations.
Demonstrated success building and developing high-performing engineering teams through mentoring, coaching, performance management, and fostering a culture of continuous learning and accountability.
Experience defining technical strategy, influencing cross-functional stakeholders, and balancing reliability, security, operational excellence, and business priorities.
Strong hands-on background with public cloud platforms, preferably AWS, including cloud-native architecture, networking, security, resilience, scalability, and cost-aware operations.
Experience leading teams that implement and operate infrastructure as code (IaC), GitOps, and automation practices to manage cloud infrastructure, platform services, and deployment workflows.
Strong understanding of CI/CD principles, release automation, and modern software delivery practices.
Experience with containerization and orchestration technologies such as Docker and Kubernetes.
Experience with observability platforms, monitoring frameworks, incident management practices, and operational analytics tools.
Demonstrated experience defining and implementing SLOs, SLIs, error budgets, production readiness standards, and incident response processes.
Strong understanding of disaster recovery, business continuity, backup and recovery strategies, and resilience testing.
Experience supporting highly available, mission-critical, or regulated technology platforms where reliability, security, and operational discipline are essential.
Exceptional communication, collaboration, and stakeholder management skills, with the ability to translate complex technical concepts into clear business outcomes for both technical and non-technical audiences.

Responsibilities

Lead, coach, and develop a high-performing team of SRE and Platform Specialists responsible for the reliability, scalability, security, and operational excellence of CIRA's registry platforms and supporting technology services.
Define and execute the platform and site reliability strategy, aligning priorities and investments with organizational objectives and customer needs.
Define and mature SRE practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, production readiness standards, and operational acceptance criteria for mission-critical registry services.
Drive the design, operation, and continuous improvement of scalable, resilient, cloud-native platforms using public cloud technologies such as AWS.
Champion automation, infrastructure as code, GitOps, CI/CD, and self-service platform capabilities to reduce manual effort, operational toil, and engineering bottlenecks.
Establish and continuously improve observability, monitoring, alerting, and dashboarding practices to provide clear visibility into platform health, service reliability, and customer-impacting issues.
Lead incident management for high-severity events, providing incident command, stakeholder communication, root cause analysis, and driving follow-up actions that strengthen long-term platform resilience.
Collaborate with engineering, security, support, compliance, and business stakeholders to establish priorities, balance risk, and deliver platform improvements that support registry operations and organizational goals.
Drive performance engineering, capacity planning, disaster recovery testing, and resilience validation to ensure the ongoing reliability and availability of critical registry platforms and related services.
Foster a culture of ownership, accountability, continuous learning, operational excellence, and psychological safety that empowers the team to innovate and perform at their best.