Staff Platform Engineer, ML & Data

Mistplay•Montreal, QC

91d•Hybrid

About The Position

Reporting to the VP of Data and Machine Learning Platform, the Staff Platform Engineer, Data & ML is responsible for defining and scaling the unified platform that powers the full data and ML lifecycle at Mistplay - from raw ingestion to real-time model serving. This role is not about modeling or analysis - it is about owning the data and ML platform as a single, cohesive system at scale. You will set technical direction across the entire lifecycle, own critical platform components end-to-end, and establish best practices that enable data and models to move from experimentation to real-time business impact with speed and trust. You will operate as a technical leader across teams, partnering with Data Science, Data Engineering, and Backend to reduce decision latency, increase throughput across both analytical and ML workloads, and improve data trust from ingestion through inference.

Requirements

End-to-End Platform Experience - 10+ years building and operating production data and ML platforms; proven ownership of systems spanning the full lifecycle from ingestion to real-time inference and decisioning.
Software Engineering - strong proficiency in Python, Scala, or Go; track record of building and evolving complex distributed systems with high reliability, maintainability, and strong engineering standards.
Data & ML Systems Depth - deep understanding of both data platform (warehousing, lakehouse, pipelines) and ML platform (feature stores, training, serving) architectures; ability to reason about trade-offs across latency, consistency, freshness, and throughput in a unified system.
Feature Platform & Serving - strong experience designing and operating feature stores with offline/online consistency; deep expertise in model serving architectures (real-time, batch, serverless) with progressive delivery (A/B, canary, shadow).
Streaming & Batch Pipelines - strong experience with streaming systems (e.g., Kafka, Flink) and batch frameworks (e.g., Spark, dbt); clear understanding of trade-offs across latency, throughput, and cost for both analytical and ML workloads.
Observability & Operations - strong operational rigor across data and ML systems (metrics, logs, traces, data quality); experience defining SLOs, capacity planning, cost optimization, and leading incident response across the full platform stack.
Technical Leadership (Staff Level) - sets technical direction across teams; mentors engineers; drives design reviews and architectural decisions spanning data and ML; balances short-term delivery with long-term platform health.
Collaboration & Influence - operates effectively across Data Science, Analytics, DevOps, and Backend; influences without authority; translates business needs into unified platform strategy and execution across both data and ML domains.

Responsibilities

Be the main driver and expert for designing, building, and operating the unified data and ML platform as a single system:
Ingestion & Pipeline Infrastructure - define scalable, reliable ingestion systems for batch and streaming data sources; drive standards for schema evolution, data contracts, and end-to-end lineage; optimize compute and cost strategy across diverse workloads.
Data Warehouse & Lakehouse Architecture - architect the core analytical data platform; define storage layer strategies, partitioning, and access patterns; establish standards for data modeling, performance, and cost efficiency that serve both analytical and ML consumers.
Transformation & Orchestration Layer - lead design of scalable, maintainable transformation systems (e.g., dbt, Spark); define orchestration standards and dependency management; enforce data quality contracts and testing frameworks that feed both downstream analytics and ML training.
Feature Platform - lead design of high-quality, reusable feature systems bridging the data and ML layers; enforce offline/online consistency contracts; drive standardization for feature definitions, ownership, and discoverability across both analytical and model consumers.
Training Infrastructure - define scalable, reproducible training and backfill systems; drive standards for dataset versioning, lineage, and reproducibility; optimize compute strategy across batch and distributed workloads.
Real-Time Inference & Serving - architect low-latency, highly available model serving systems; define deployment strategies (canary, shadow, A/B); establish patterns for autoscaling, traffic routing, and failure isolation at scale.
Observability & Reliability (End-to-End) - establish standards for data quality, feature drift, and model performance monitoring across the full lifecycle; define SLOs and operational practices; lead incident response frameworks and postmortem culture spanning data and ML systems.
Platform Tooling & Evolution -evaluate, integrate, and rationalize platform components across the data and ML stack (e.g., Kafka, dbt, Spark, MLflow, feature stores, serving systems); lead large-scale migrations and platform simplification with minimal disruption.
Technical Leadership & Strategy (Staff Scope) - set architecture direction across the unified platform; define long-term strategy and drive cross-team alignment; identify and prioritize highest-leverage investments tied to business outcomes.