Manager, Site Reliability Engineering (SRE)

Thomson Reuters•Toronto, ON

64d•Hybrid

About The Position

In this opportunity as Site Reliability Engineering Manager, you will be responsible for: Team Leadership: Lead and mentor a team of SREs, providing technical guidance, coaching, and support to foster a culture of collaboration, innovation, and continuous improvement. Strategic Vision and Planning: Develop and implement a strategic vision for the SRE team to align with organizational goals and drive continuous improvements in reliability and performance. Performance Metrics and Reporting: Establish and monitor key performance indicators (KPIs) to measure the success of SRE initiatives and communicate results and insights to stakeholders. Operational Excellence: Drive the implementation of best practices for reliability, scalability, and performance across our systems and services. Risk Management: Proactively identify potential risks to service reliability and develop strategies to mitigate these risks, ensuring business continuity and resilience. System Architecture: Collaborate with cross-functional teams to design, build, and maintain scalable and resilient architectures for our cloud-based infrastructure and applications. Identify opportunities for optimization and efficiency improvements. Solve intractable problems and devising solutions to improve the products and services we offer our customers. DevOps Practices: Promote and implement DevOps principles and practices to streamline software delivery, automate infrastructure provisioning, and improve deployment processes. Collaborate with development teams to integrate SRE practices into the software development lifecycle. Automation and Tooling: Champion the use of automation and tooling to streamline operational workflows, increase efficiency, and reduce manual toil. Drive the development of monitoring, alerting, and automation solutions to proactively identify and remediate issues. Continuous Improvement: Promote a culture of continuous improvement by fostering innovation, experimentation, and learning within the team. Encourage knowledge sharing and professional development to enhance technical skills and expertise.

Requirements

5+ years’ experience in a leadership role, managing a team of DevOps engineers and/or Site Reliability engineers or related technical professionals.
Bachelor’s degree or equivalent required, Computer Science or related technical degree preferred.
5-10 years of relevant experience in software development and/or technology platform, infrastructure, or operations.
Hands-on experience with programming and scripting languages.
Strong people management skills to effectively lead, motivate, and develop team members, including conducting performance evaluations and providing constructive feedback to drive continuous improvement and team success.
Experience with AI/ML tools to help improve service, reduce costs, and worked with AI-Operations solutions.
You have experience with cloud technologies, services, use of their APIs. (e.g., AWS, Azure, GCP).
Proficiency in DevOps practices and methodologies, with hands-on experience in CI/CD pipelines, configuration management, and infrastructure as code Infrastructure as Code (IaC) tools such as Terraform and Bicep.
Familiarity with programming languages such as Python, Java, C#.
Experience designing and supporting scalable systems and services.
Experience in leading release management processes, ensuring successful software releases by coordinating with cross-functional teams and overseeing the deployment, monitoring, and maintenance of new features and updates.
Proficiency in Observability tools such as Data Dog or New Relic

Responsibilities

Team Leadership: Lead and mentor a team of SREs, providing technical guidance, coaching, and support to foster a culture of collaboration, innovation, and continuous improvement.
Strategic Vision and Planning: Develop and implement a strategic vision for the SRE team to align with organizational goals and drive continuous improvements in reliability and performance.
Performance Metrics and Reporting: Establish and monitor key performance indicators (KPIs) to measure the success of SRE initiatives and communicate results and insights to stakeholders.
Operational Excellence: Drive the implementation of best practices for reliability, scalability, and performance across our systems and services.
Risk Management: Proactively identify potential risks to service reliability and develop strategies to mitigate these risks, ensuring business continuity and resilience.
System Architecture: Collaborate with cross-functional teams to design, build, and maintain scalable and resilient architectures for our cloud-based infrastructure and applications. Identify opportunities for optimization and efficiency improvements. Solve intractable problems and devising solutions to improve the products and services we offer our customers.
DevOps Practices: Promote and implement DevOps principles and practices to streamline software delivery, automate infrastructure provisioning, and improve deployment processes. Collaborate with development teams to integrate SRE practices into the software development lifecycle.
Automation and Tooling: Champion the use of automation and tooling to streamline operational workflows, increase efficiency, and reduce manual toil. Drive the development of monitoring, alerting, and automation solutions to proactively identify and remediate issues.
Continuous Improvement: Promote a culture of continuous improvement by fostering innovation, experimentation, and learning within the team. Encourage knowledge sharing and professional development to enhance technical skills and expertise.

Benefits

Hybrid Work Model: We’ve adopted a flexible hybrid working environment (2-3 days a week in the office depending on the role) for our office-based roles while delivering a seamless experience that is digitally and physically connected.
Flexibility & Work-Life Balance: Flex My Way is a set of supportive workplace policies designed to help manage personal and professional responsibilities, whether caring for family, giving back to the community, or finding time to refresh and reset. This builds upon our flexible work arrangements, including work from anywhere for up to 8 weeks per year, empowering employees to achieve a better work-life balance.
Career Development and Growth: By fostering a culture of continuous learning and skill development, we prepare our talent to tackle tomorrow’s challenges and deliver real-world solutions. Our Grow My Way programming and skills-first approach ensures you have the tools and knowledge to grow, lead, and thrive in an AI-enabled future.
Industry Competitive Benefits: We offer comprehensive benefit plans to include flexible vacation, two company-wide Mental Health Days off, access to the Headspace app, retirement savings, tuition reimbursement, employee incentive programs, and resources for mental, physical, and financial wellbeing.
Culture: Globally recognized, award-winning reputation for inclusion and belonging, flexibility, work-life balance, and more. We live by our values: Obsess over our Customers, Compete to Win, Challenge (Y)our Thinking, Act Fast / Learn Fast, and Stronger Together.
Social Impact: Make an impact in your community with our Social Impact Institute. We offer employees two paid volunteer days off annually and opportunities to get involved with pro-bono consulting projects and Environmental, Social, and Governance (ESG) initiatives.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume