Senior Platform Reliability Engineer - Global AI Platform

Manulife•Toronto, ON

4d•$113,000 - $163,000•Hybrid

About The Position

As a Senior Platform Reliability Engineer, you will be responsible for monitoring, analyzing and optimizing software architecture and maintaining software environment to best support testing and deployment in continuous integration/continuous delivery environment. This role provides a reliable and scalable platform experience to the Global AI Platform Users. You will be responsible for developing self-service capabilities, AIOps/MLOps/GitOps/CI/CD pipelines, and operational automations for provisioning, upgrades, and backups. You will manage clusters, networks, storage, and policies via Terraform/Ansible, preventing configuration drift. Additionally, you will enforce identity/RBAC, secrets management, supply chain security, and regulatory controls, collaborating with risk and audit teams. Optimization of resource usage, capacity planning, and spending control (rightsizing, autoscaling, reservations/spot) are key aspects of this role. You will also be involved in safe rollouts, progressive delivery, and implementing policy-as-code guardrails. This position resolves persistent platform issues when surfaced by technical support teams, provides performance enhancements through automation, and pushes for enhanced reliability of the platform to support product development. You will deliver resilient and scalable applications, focusing on continuous delivery and operational insight. Collaboration with platform and software engineers, platform reliability engineers, Product Owners, and engineering leadership is expected to uncover pain points and opportunities to accelerate the delivery of new value through software. You will investigate new platform solutions to enhance service delivery experience and address incidents and problems, with rotational accountability for on-call support.

Requirements

Familiarity with agile and DevOps principles, test-driven development, continuous integration, and other approaches to accelerate the delivery of new features
Understanding of software development lifecycle
Understanding of how technology supports Manulife business strategy
Deep understanding of DevOps principles, prioritizes platform over products
Attends advanced training sessions and is certified on multiple domains of expertise
Demonstrates all core skills, and good interpersonal skills for the role
Good working and background knowledge of area of practice
Use and combine knowledge of the discipline and the market to formulate the right approach
Participates in functional demos utilizing new tech; designs own control structures
Sees actions partly in terms of longer-term goals
Understands the corporate climate & culture
Strong knowledge of the business
Experience with virtual infrastructure, CICD tools such as Jenkins, Github, TeamCity etc.
Experience in languages such as Python, Java, JavaScript, .NET, HTML5, CSS3, Swift and/or similar technologies
Understanding of systems monitoring tools and analytics (New Relic, MoogSoft, xMatter, etc.)
Experience with Cloud Foundry and other components supporting a highly-automated global engineering platform
Collaborative attitude, willingness to work with team members; able to coach, participate in code reviews, share skills and methods
Constantly learns from both success and failure
Experience with open-source technologies preferable
Good organizational and problem-solving abilities that enable you to manage through creative abrasion
Good verbal and written communication; able to effectively articulate technical vision, possibilities, and outcomes
Experiments with emerging technologies and understanding how they will impact what comes next.
Bachelor’s in Computer Science/Engineering or equivalent experience (not strictly required if skills demonstrated).
5–8+ years in DevOps/Platform Engineering or Production Operations (8+ preferred for senior level).
Proficiency in Python and/or Java/Scala/TypeScript for backend services and automation.
Hands on experience with Azure, Kubernetes, containers, CI/CD, and observability stacks.
Strong understanding of LLM systems, retrieval architectures, embeddings, vector stores, prompt/tool orchestration, and agent workflow fundamentals.
Expertise in API design, asynchronous workflows, concurrency, reliability engineer concepts (SLOs, error budgets), and performance tuning.
Familiarity with security, governance, and compliance for AI/data systems (authN/authZ, data protection, audit logging, model governance).
Proven track record operating large scale distributed systems and running on call.
Ability to collaborate across global teams and translate business needs into platform capabilities and operational SLAs.

Responsibilities

Provides reliable and scalable platform experience to the Global AI Platform Users
Responsible for monitoring, analyzing, optimizing and maintaining software environment to best support testing and deployment in continuous integration/continuous delivery environment.
Develop self-service capabilities, AIOps/MLOps/GitOps/CI/CD pipelines, and operational automations (provisioning, upgrades, backups).
Manage clusters, networks, storage, and policies via Terraform/Ansible; prevent configuration drift.
Enforce identity/RBAC, secrets management, supply chain security, and regulatory controls; collaborate with risk and audit.
Optimize resource usage, plan capacity, control spending (rightsizing, autoscaling, reservations/spot).
Safe rollouts, progressive delivery, and policy-as-code guardrails.
Resolves persistent platform issues when surfaced by technical support teams
Provides performance enhancements through automation and pushes for enhanced reliability of platform to support product development
Delivers resilient and scalable applications, with a focus on continuous delivery and operational insight
Collaborates with platform and software engineers, platform reliability engineers, Product Owners, and engineering leadership to uncover pain points and opportunities to accelerate the delivery of new value through software
Investigates new platform solutions to enhance service delivery experience
Resolves persistent platform issues when surfaced by technical support teams
Delivers good user experience to other engineers, with a focus on self-service and continuous delivery
Addresses incidents and problems, with rotational accountability for on-call support