Senior AI/ML Engineer - Site Reliability Engineering

Royal Bank of Canada•Toronto, ON

2d•Onsite

About The Position

Join RBC's Site Reliability Engineering team as a founding member building the bank's first-ever Agentic AI platform for Software reliability and resiliency. You'll pioneer intelligent automation systems that autonomously prevent incidents, accelerate response times, and transform how we maintain resilience across enterprise systems. This is a rare opportunity to shape the future of AI-driven reliability at scale. Your innovations will protect millions of daily customer transactions and sign-ins. With a clear technical leadership trajectory, you'll architect cutting-edge solutions at the intersection of AI and infrastructure, setting the standard for autonomous operations in financial services.

Requirements

Strong ML engineering background with hands-on experience designing, training, and deploying machine learning models in production environments
Proven expertise in Agentic AI frameworks and tools (LangChain, LangGraph, AutoGen, CrewAI, or similar) and building autonomous, multi-agent systems
Deep understanding of Model Context Protocol (MCP) for enabling AI agents to interact with external systems and data sources
Experience building AI agents with tool-calling capabilities, memory management, and reasoning chains
Proficiency in Python and experience with ML libraries (scikit-learn, TensorFlow, PyTorch, or similar)
Working knowledge of containerization (Docker), orchestration (Kubernetes/OpenShift), and infrastructure-as-code principles (Ansible, Terraform)
Demonstrated ability to translate complex technical concepts into business value and collaborate effectively with cross-functional teams

Nice To Haves

Prior experience in Site Reliability Engineering, DevOps, or infrastructure monitoring roles
Familiarity with observability tools (Prometheus, Grafana, ELK stack) and incident management platforms (PagerDuty, ServiceNow)
Experience with LLMs, prompt engineering, and retrieval-augmented generation (RAG) architectures
Background in financial services or other highly regulated industries with strict reliability requirements

Responsibilities

Design and implement end-to-end Agentic AI solutions that autonomously detect anomalies, identify root causes, and resolve incidents with minimal human intervention
Develop intelligent automation frameworks using LangChain and LangGraph to create context-aware agents that learn from incident patterns and continuously improve response strategies
Build ML-powered monitoring and alerting systems that distinguish signal from noise, dramatically reducing false positives and improving MTTD (Mean Time to Detect) and MTTI (Mean Time to Identify)
Architect scalable, production-grade solutions on OpenShift and Kubernetes that process real-time system metrics and telemetry data at enterprise scale
Implement infrastructure-as-code using Ansible and containerization (Docker) to ensure reproducibility, consistency, and rapid deployment across environments
Partner with incident management and operations teams to translate operational pain points into AI-driven automation opportunities that measurably reduce toil
Establish and track KPIs focused on reducing MTTR (Mean Time to Resolve), MTTD, and MTTI while improving system reliability
Lead technical design discussions and contribute to architectural decisions that shape RBC's AI-powered reliability strategy

Benefits

bonuses
flexible benefits
competitive compensation
commissions
stock where applicable
Leaders who support your development through coaching and managing opportunities
Ability to make a difference and lasting impact
Work in a dynamic, collaborative, progressive, and high-performing team
A world-class training program in financial services
Flexible work/life balance options
Opportunities to do challenging work