Lead Platform Reliability Engineer, Global AI Platform & Solutions

Manulife•Toronto, ON

4d•$113,260 - $210,340•Hybrid

About The Position

The Lead Platform Reliability Engineer (PRE) ensures the stability, performance, and scalability of the shared platform that supports internal AI solution development. It combines software engineering, SRE practices, and operations to keep the platform reliable and developer-friendly. The role involves operating scalable backend services supporting high-traffic agent interactions, retrieval operations, and real-time execution flows. The PRE will also maintain AI services runbooks, playbooks, and enablement for GOCC. Collaboration with global engineering, security, and AI governance teams is essential to ensure compliance with cross-geo regulations and Asia’s data residency requirements.

Requirements

Bachelor’s in Computer Science/Engineering or equivalent experience (not strictly required if skills demonstrated).
5-8 years experience in DevOps/Platform Engineering or Production Operations.
Proven track record operating large-scale distributed systems and running on-call.
Operational experience with cloud-native development: Azure, Kubernetes, containers, CI/CD, and observability stacks.
Knowledge with Python and/or Java/Scala/TypeScript for building backend services and automation.
Understanding of AI solution, LLM systems, retrieval architectures, embeddings, vector stores, prompt/tool orchestration, and agent workflow fundamentals.
Knowledge of API design, asynchronous workflows, concurrency, reliability engineering (SLOs, error budgets), and performance tuning.
Familiarity with security, governance, and compliance for AI/data systems (authN/authZ, data protection, audit logging, model governance).
Ability to collaborate across global teams and translate business requirements into platform capabilities and operational SLAs.

Nice To Haves

ITIL & ITSM certification
Azure Administrator/DevOps certificate
Kubernetes: CKA/CKS certificate
HashiCorp Terraform Associate certificate

Responsibilities

Define SLOs/SLIs, track operations budgets, reduce MTTR, capacity plan, and tune autoscaling.
Build and maintain logging, metrics, tracing, and alerting; instrument platform components; create runbooks and dashboards.
On-call for platform incidents; triage, mitigate, root-cause, and drive postmortems and corrective actions.
Develop self-service capabilities, AIOps/MLOps/GitOps/CICD pipelines, and operational automations (provisioning, upgrades, backups).
Manage clusters, networks, storage, and policies via Terraform/Ansible; prevent configuration drift.
Enforce identity/RBAC, secrets management, supply chain security, and regulatory controls; collaborate with risk and audit.
Optimize resource usage, plan capacity, control spend (rightsizing, autoscaling, reservations/spot).
Implement safe rollouts, progressive delivery, and policy-as-code guardrails.
Treat the platform as a product, define operations SLAs in alignment to product roadmap, service catalog, and developer experience.