Senior SRE Engineer – AI Platform

RBC•Calgary, AB

8d•Onsite

About The Position

We're looking for an experienced Senior Site Reliability Engineer who will bring focus and subject-matter expertise around designing and implementing reliable, scalable AI service infrastructure and automation systems. This is a unique opportunity to grow in the world of AI operations and work with a team of passionate individuals committed to bringing enterprise-grade reliability to our production AI services. At RBC Borealis, you’ll be joining a team that works directly with leading researchers in machine learning, has access to rich and massive datasets, and offers the computational resources to support ongoing development in areas such as reinforcement learning, unsupervised learning and computer vision.

Requirements

Strong and relevant experience designing and implementing distributed systems and reliability infrastructure for AI systems
Proven expertise in Site Reliability Engineering practices, including observability, alerting, and incident management
Working with building and maintaining CI/CD pipelines such as Jenkins, GitHub Actions, or similar tools
In-depth knowledge of Kubernetes and OpenShift Container Platform (OCP4) or similar container orchestration platforms
Hands-on experience with observability and monitoring platforms such as Dynatrace, Datadog, or similar solutions
Experience implementing logging and tracing solutions for distributed systems using platforms like Elasticsearch or similar tools
Experience optimizing containerized workloads and managing cloud infrastructure across hybrid environments
Hands-on experience building and deploying hybrid environments on-prem and major cloud environments, such as AWS and Azure
Experience managing NoSQL databases such as MongoDB in production environments
Familiarity with machine learning model deployment, serving, and operational requirements
Experience or interest in exploring self-hosted machine learning model infrastructure and agentic workflow systems
Familiarity with programming languages such as Python, Bash or JavaScript

Responsibilities

Designing, building, and optimizing AI service reliability infrastructure and automation systems that operate the business's AI and ML applications
Designing and implementing best practices and standards for reliability, observability, and incident response across AI systems and ML pipelines
Collaborating with engineers and machine learning researchers to ensure continuous deployment, monitoring, and resilience of AI applications at scale
Supporting AI applications and projects with infrastructure design decisions, capacity planning, and comprehensive observability
Building highly scalable, resilient cloud and on-premise systems for hosting AI services using state-of-the-art technologies