Sr. Machine Learning Ops Engineer

McKesson•Mississauga, ON

8d•CA$99,100 - CA$132,100•Hybrid

About The Position

Join McKesson’s growing AI/ML team and play a critical role in operationalizing machine learning and Generative AI solutions at scale. This role focuses on deploying, standardizing, and maintaining production-ready ML and agentic AI systems—enabling consistent, reliable, and optimized delivery of data science innovations that support McKesson’s AIM28 strategic initiatives.

Requirements

Strong experience deploying ML models into production environments
Hands-on expertise with CI/CD pipelines, monitoring, and production ML systems
Experience with GenAI or agentic AI frameworks (LangChain, Semantic Kernel, etc.)
Knowledge of model observability, drift detection, and operational support
Experience working in scaling or early-stage ML environments
Proficiency with cloud platforms (AWS, Azure, or GCP)
Strong cross-functional collaboration skills (Data Science, Product, Architecture)
Ability to drive standardization, automation, and platform maturity
Focus on reliability, scalability, and optimization
Degree or equivalent and typically requires 7+ years of relevant experience.

Nice To Haves

Experience with Databricks ecosystem (e.g., Databricks Genie)
Familiarity with LangChain, LangGraph, or Microsoft Semantic Kernel
Exposure to GenAI cost optimization / FinOps practices
Experience implementing secure enterprise applications (e.g., Okta)
Experience in healthcare or regulated environments
Experience scaling ML/AI capabilities from experimentation to production maturity

Responsibilities

Lead deployment and operationalization of ML models and GenAI/agentic solutions, ensuring scalability, reliability, and performance
Partner with Data Scientists to identify and automate high-impact model use cases, building end-to-end pipelines (CI/CD, monitoring, alerting)
Define and enforce standardized deployment patterns and runbooks across teams
Own KTLO (keep-the-lights-on) operations for ML and GenAI systems including health monitoring, logging, and performance tracking
Design and implement pipelines for batch, real-time, and event-driven inference
Establish observability frameworks (monitoring, logging, lineage, alerting)
Enable deployment of agentic AI solutions using tools such as LangChain, LangGraph, Semantic Kernel, and Databricks tools
Ensure secure deployment of applications with proper access controls (e.g., Okta integration)
Drive cost and performance optimization across ML and GenAI workloads
Partner with architecture, compliance, governance, and legal teams to meet enterprise standards
Conduct ongoing research into emerging tools and technologies to improve deployment practices
Guide and influence architectural decisions while maintaining clear separation between platform and deployment ownership