Director, Data Science, AI Evaluations Platform

Royal Bank of Canada•Toronto, ON

1d•Onsite

About The Position

RBC’s AI Group is building trusted AI capabilities for the enterprise, and evaluation is one of the core controls that makes that possible. As Director, Data Science, AI Evaluations Platform, you will lead the data science function responsible for how RBC measures model and agent quality, safety, risk, and performance. You will hire, develop, and mentor a high-performing team that owns evaluation datasets, sourcing methods, LLM judge design, deterministic scorers, human evaluation protocols, quality benchmarks, and measurement frameworks that help AI systems move from experimentation to production with evidence and control.

Requirements

8+ years of experience in data science, applied machine learning, AI evaluation, ML quality, or a related technical field, including experience leading and developing technical teams.
Strong experience designing evaluation frameworks for ML, generative AI, or agentic AI systems, including metrics, datasets, benchmarks, rubrics, and quality measurement.
Practical experience with LLM evaluation methods such as LLM-as-judge, deterministic scoring, human evaluation, hallucination assessment, factuality assessment, safety evaluation, or model quality benchmarking.
Strong technical foundation in data science, statistics, machine learning, experimentation, data curation, Python, SQL, and modern AI/ML development practices.
Proven ability to translate governance, model risk, responsible AI, and business requirements into measurable controls, repeatable evaluation processes, and decision-ready evidence.
Strong communication and stakeholder management skills, with the ability to influence senior leaders across research, engineering, product, governance, risk, and business teams.

Nice To Haves

Experience evaluating agentic AI systems, tool-calling workflows, multi-step reasoning, runtime traces, trajectory scoring, or workflow-level performance.
Experience in financial services, regulated AI, model risk management, responsible AI, enterprise governance, or audit-ready evidence processes.
Familiarity with tools and platforms such as MLflow, Langfuse, LangSmith, OpenTelemetry, Grafana, CI/CD pipelines, or comparable evaluation and observability tooling.
Publications, patents, open-source contributions, or industry work related to AI evaluation, ML quality, AI safety, applied research, or responsible AI.

Responsibilities

Lead and grow a high-performing data science team focused on model and agent evaluations, including evaluation datasets, rubrics, LLM-as-judge methods, deterministic scorers, human evaluation, and measurement frameworks.
Define evaluation science standards that translate model risk, responsible AI, product quality, safety, and business expectations into measurable criteria, repeatable methods, and clear evidence.
Own the end-to-end lifecycle for evaluation datasets and scorecards, including sourcing, curation, validation, quality checks, versioning, lineage, reuse, and ongoing improvement.
Design scalable evaluation approaches for generative AI and agentic systems, including task-level, workflow-level, trajectory-level, and runtime evaluation methods.
Partner with AI research, platform engineering, product, risk, governance, and business teams to embed evaluations into build, release, certification, monitoring, and recertification workflows.
Establish human evaluation and review protocols that produce reliable labels, reviewer guidance, adjudication processes, quality controls, and audit-ready evidence.
Measure and improve scorer accuracy, calibration, robustness, failure-mode coverage, and explainability across automated and human evaluation approaches.
Provide clear technical leadership, executive-ready communication, and coaching to help RBC scale trusted AI with speed, rigor, and control.