Senior Platform Reliability Engineer - Global AI Platform

ManulifeToronto, ON
$113,000 - $163,000Hybrid

About The Position

As a Senior Platform Reliability Engineer, you will be responsible for monitoring, analyzing and optimizing software architecture and maintaining software environment to best support testing and deployment in continuous integration/continuous delivery environment. This role provides a reliable and scalable platform experience to the Global AI Platform Users. You will be responsible for developing self-service capabilities, AIOps/MLOps/GitOps/CI/CD pipelines, and operational automations for provisioning, upgrades, and backups. You will manage clusters, networks, storage, and policies via Terraform/Ansible, preventing configuration drift. Additionally, you will enforce identity/RBAC, secrets management, supply chain security, and regulatory controls, collaborating with risk and audit teams. Optimization of resource usage, capacity planning, and spending control (rightsizing, autoscaling, reservations/spot) are key aspects of this role. You will also be involved in safe rollouts, progressive delivery, and implementing policy-as-code guardrails. This position resolves persistent platform issues when surfaced by technical support teams, provides performance enhancements through automation, and pushes for enhanced reliability of the platform to support product development. You will deliver resilient and scalable applications, focusing on continuous delivery and operational insight. Collaboration with platform and software engineers, platform reliability engineers, Product Owners, and engineering leadership is expected to uncover pain points and opportunities to accelerate the delivery of new value through software. You will investigate new platform solutions to enhance service delivery experience and address incidents and problems, with rotational accountability for on-call support.

Requirements

  • Familiarity with agile and DevOps principles, test-driven development, continuous integration, and other approaches to accelerate the delivery of new features
  • Understanding of software development lifecycle
  • Understanding of how technology supports Manulife business strategy
  • Deep understanding of DevOps principles, prioritizes platform over products
  • Attends advanced training sessions and is certified on multiple domains of expertise
  • Demonstrates all core skills, and good interpersonal skills for the role
  • Good working and background knowledge of area of practice
  • Use and combine knowledge of the discipline and the market to formulate the right approach
  • Participates in functional demos utilizing new tech; designs own control structures
  • Sees actions partly in terms of longer-term goals
  • Understands the corporate climate & culture
  • Strong knowledge of the business
  • Experience with virtual infrastructure, CICD tools such as Jenkins, Github, TeamCity etc.
  • Experience in languages such as Python, Java, JavaScript, .NET, HTML5, CSS3, Swift and/or similar technologies
  • Understanding of systems monitoring tools and analytics (New Relic, MoogSoft, xMatter, etc.)
  • Experience with Cloud Foundry and other components supporting a highly-automated global engineering platform
  • Collaborative attitude, willingness to work with team members; able to coach, participate in code reviews, share skills and methods
  • Constantly learns from both success and failure
  • Experience with open-source technologies preferable
  • Good organizational and problem-solving abilities that enable you to manage through creative abrasion
  • Good verbal and written communication; able to effectively articulate technical vision, possibilities, and outcomes
  • Experiments with emerging technologies and understanding how they will impact what comes next.
  • Bachelor’s in Computer Science/Engineering or equivalent experience (not strictly required if skills demonstrated).
  • 5–8+ years in DevOps/Platform Engineering or Production Operations (8+ preferred for senior level).
  • Proficiency in Python and/or Java/Scala/TypeScript for backend services and automation.
  • Hands on experience with Azure, Kubernetes, containers, CI/CD, and observability stacks.
  • Strong understanding of LLM systems, retrieval architectures, embeddings, vector stores, prompt/tool orchestration, and agent workflow fundamentals.
  • Expertise in API design, asynchronous workflows, concurrency, reliability engineer concepts (SLOs, error budgets), and performance tuning.
  • Familiarity with security, governance, and compliance for AI/data systems (authN/authZ, data protection, audit logging, model governance).
  • Proven track record operating large scale distributed systems and running on call.
  • Ability to collaborate across global teams and translate business needs into platform capabilities and operational SLAs.

Responsibilities

  • Provides reliable and scalable platform experience to the Global AI Platform Users
  • Responsible for monitoring, analyzing, optimizing and maintaining software environment to best support testing and deployment in continuous integration/continuous delivery environment.
  • Develop self-service capabilities, AIOps/MLOps/GitOps/CI/CD pipelines, and operational automations (provisioning, upgrades, backups).
  • Manage clusters, networks, storage, and policies via Terraform/Ansible; prevent configuration drift.
  • Enforce identity/RBAC, secrets management, supply chain security, and regulatory controls; collaborate with risk and audit.
  • Optimize resource usage, plan capacity, control spending (rightsizing, autoscaling, reservations/spot).
  • Safe rollouts, progressive delivery, and policy-as-code guardrails.
  • Resolves persistent platform issues when surfaced by technical support teams
  • Provides performance enhancements through automation and pushes for enhanced reliability of platform to support product development
  • Delivers resilient and scalable applications, with a focus on continuous delivery and operational insight
  • Collaborates with platform and software engineers, platform reliability engineers, Product Owners, and engineering leadership to uncover pain points and opportunities to accelerate the delivery of new value through software
  • Investigates new platform solutions to enhance service delivery experience
  • Resolves persistent platform issues when surfaced by technical support teams
  • Delivers good user experience to other engineers, with a focus on self-service and continuous delivery
  • Addresses incidents and problems, with rotational accountability for on-call support

Benefits

  • health
  • dental
  • mental health
  • vision
  • short- and long-term disability
  • life and AD&D insurance coverage
  • adoption/surrogacy and wellness benefits
  • employee/family assistance plans
  • pension
  • global share ownership plan with employer matching contributions
  • financial education and counseling resources
  • holidays
  • vacation
  • personal
  • sick days
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service