Senior SRE Engineer – AI Platform

RBCCalgary, AB
Onsite

About The Position

We're looking for an experienced Senior Site Reliability Engineer who will bring focus and subject-matter expertise around designing and implementing reliable, scalable AI service infrastructure and automation systems. This is a unique opportunity to grow in the world of AI operations and work with a team of passionate individuals committed to bringing enterprise-grade reliability to our production AI services. At RBC Borealis, you’ll be joining a team that works directly with leading researchers in machine learning, has access to rich and massive datasets, and offers the computational resources to support ongoing development in areas such as reinforcement learning, unsupervised learning and computer vision.

Requirements

  • Strong and relevant experience designing and implementing distributed systems and reliability infrastructure for AI systems
  • Proven expertise in Site Reliability Engineering practices, including observability, alerting, and incident management
  • Working with building and maintaining CI/CD pipelines such as Jenkins, GitHub Actions, or similar tools
  • In-depth knowledge of Kubernetes and OpenShift Container Platform (OCP4) or similar container orchestration platforms
  • Hands-on experience with observability and monitoring platforms such as Dynatrace, Datadog, or similar solutions
  • Experience implementing logging and tracing solutions for distributed systems using platforms like Elasticsearch or similar tools
  • Experience optimizing containerized workloads and managing cloud infrastructure across hybrid environments
  • Hands-on experience building and deploying hybrid environments on-prem and major cloud environments, such as AWS and Azure
  • Experience managing NoSQL databases such as MongoDB in production environments
  • Familiarity with machine learning model deployment, serving, and operational requirements
  • Experience or interest in exploring self-hosted machine learning model infrastructure and agentic workflow systems
  • Familiarity with programming languages such as Python, Bash or JavaScript

Responsibilities

  • Designing, building, and optimizing AI service reliability infrastructure and automation systems that operate the business's AI and ML applications
  • Designing and implementing best practices and standards for reliability, observability, and incident response across AI systems and ML pipelines
  • Collaborating with engineers and machine learning researchers to ensure continuous deployment, monitoring, and resilience of AI applications at scale
  • Supporting AI applications and projects with infrastructure design decisions, capacity planning, and comprehensive observability
  • Building highly scalable, resilient cloud and on-premise systems for hosting AI services using state-of-the-art technologies

Benefits

  • bonuses
  • flexible benefits
  • competitive compensation
  • commissions
  • stock options
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service