Machine Learning Platform Engineer

RBC•Toronto, ON

1d•Onsite

About The Position

We’re looking for an experienced Machine Learning Platform Engineer who will bring focus and subject-matter expertise around designing and implementing machine learning infrastructure and automation tools (MLOps and DevOps). This is a unique opportunity to grow in the world of machine learning infrastructure and work with a team of passionate individuals committed to the mission of bringing ML to enterprise. At RBC Borealis, you’ll be joining a team that works directly with leading researchers in machine learning, has access to rich and massive datasets, and offers the computational resources to support ongoing development in areas such as reinforcement learning, unsupervised learning and computer vision.

Requirements

Strong experience designing and operating distributed/ML systems plus deep Kubernetes/OpenShift knowledge (Helm, operators, custom resources, RBAC, troubleshooting)
Proven history building DevOps/CI/CD pipelines (GitHub Actions), multi-stage Docker images, registry mirroring, and infrastructure automation in restricted enterprise environments
In-depth knowledge of various stages of the machine learning application deployment process
Proficiency with programming languages such as Python, Bash, or Rust
Solid grasp of software engineering best practices—testing (unit/integration), coding standards, code reviews, source control—and implementing production monitoring, alerting
Hands-on experience building and deploying hybrid environments on-premises enterprise environments
Familiarity with the Large Language Model (LLM) inference and serving such as VLLM or similar

Responsibilities

Deploying and operating the GenAI platform across OpenShift/Kubernetes
Managing large language model deployments (Cohere Command, Llama, Mistral) on GPU infrastructure (NVIDIA A100/H100), and configuring RAG pipelines with serving frameworks like vLLM, NVIDIA NIM, and TensorRT-LLM
Monitoring GPU utilization, model performance metrics, and resource allocation across the platform
Implementing observability stacks—Prometheus, Grafana, Pushgateway, and structured logging pipelines—to surface platform health, performance, and security signals
Designing and implementing best practices and standards for data and machine learning pipelines across the organization
Supporting platform users and cross-functional teams through infrastructure design guidance, thorough documentation, and collaboration across multiple RBC locations
Building highly scalable, resilient on-premise systems for hosting machine learning systems using state-of-the-art technologies