Machine Learning Platform Engineer

RBCToronto, ON
Onsite

About The Position

We’re looking for an experienced Machine Learning Platform Engineer who will bring focus and subject-matter expertise around designing and implementing machine learning infrastructure and automation tools (MLOps and DevOps). This is a unique opportunity to grow in the world of machine learning infrastructure and work with a team of passionate individuals committed to the mission of bringing ML to enterprise. At RBC Borealis, you’ll be joining a team that works directly with leading researchers in machine learning, has access to rich and massive datasets, and offers the computational resources to support ongoing development in areas such as reinforcement learning, unsupervised learning and computer vision.

Requirements

  • Strong experience designing and operating distributed/ML systems plus deep Kubernetes/OpenShift knowledge (Helm, operators, custom resources, RBAC, troubleshooting)
  • Proven history building DevOps/CI/CD pipelines (GitHub Actions), multi-stage Docker images, registry mirroring, and infrastructure automation in restricted enterprise environments
  • In-depth knowledge of various stages of the machine learning application deployment process
  • Proficiency with programming languages such as Python, Bash, or Rust
  • Solid grasp of software engineering best practices—testing (unit/integration), coding standards, code reviews, source control—and implementing production monitoring, alerting
  • Hands-on experience building and deploying hybrid environments on-premises enterprise environments
  • Familiarity with the Large Language Model (LLM) inference and serving such as VLLM or similar

Responsibilities

  • Deploying and operating the GenAI platform across OpenShift/Kubernetes
  • Managing large language model deployments (Cohere Command, Llama, Mistral) on GPU infrastructure (NVIDIA A100/H100), and configuring RAG pipelines with serving frameworks like vLLM, NVIDIA NIM, and TensorRT-LLM
  • Monitoring GPU utilization, model performance metrics, and resource allocation across the platform
  • Implementing observability stacks—Prometheus, Grafana, Pushgateway, and structured logging pipelines—to surface platform health, performance, and security signals
  • Designing and implementing best practices and standards for data and machine learning pipelines across the organization
  • Supporting platform users and cross-functional teams through infrastructure design guidance, thorough documentation, and collaboration across multiple RBC locations
  • Building highly scalable, resilient on-premise systems for hosting machine learning systems using state-of-the-art technologies

Benefits

  • bonuses
  • flexible benefits
  • competitive compensation
  • commissions
  • stock options
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service