Senior Machine Learning Engineer - Power Factors

Opportunities with AppDirect's Advisor Partners (Recruitment as a Service)•Brossard, QC

About The Position

Power Factors is seeking a Senior Machine Learning Engineer to join their Innovation team. This role will focus on ambitious technical initiatives, specifically building and fine-tuning Large Language Models (LLMs) using Power Factors' unique dataset. The engineer will be involved in all stages of the process, from data preparation and architecture design to aligning with business value, developing curriculum and tokenization strategies, and scaling models from proof-of-concept to production. The role involves significant architectural decision-making for foundational time-series models, including choices related to model heads, attention mechanisms, context/horizon sizing, and handling multi-frequency data. The engineer will also design the tokenization strategy for a multi-modal training corpus, estimate scaling laws for fleet-scale capacity, and stay abreast of the latest time-series foundation model literature. A key responsibility is designing an experimental framework to validate the business value of these models for target use cases. On the training infrastructure side, the engineer will build and maintain a reproducible training environment with experiment tracking. They will optimize distributed training pipelines for efficiency and implement a model registry for artifact management. The pre-training recipe, including learning rate schedules and curriculum strategies, will also be owned by this role. For fleet-scale execution, the engineer will scale training from pilot to a larger dataset, handle real-world data quality issues, and report baseline metrics. Collaboration with Backend/Data Engineers and the Tech Lead/Product team is crucial for ensuring data quality, pipeline integration, and meeting customer requirements. Documentation of training procedures and architecture decisions is also expected.

Requirements

5+ years of ML engineering experience, with meaningful time in foundation model or large-scale model development.
Deep expertise in time-series modelling — multivariate, multi-frequency, and heterogeneous sensor data.
Proven ability to design and train transformer-based or sequence model architectures from scratch.
Distributed training engineering: GPU cluster config, mixed-precision training, gradient accumulation, checkpointing, and fault recovery.
Tokenization and representation design for continuous time-series data: quantization, patching, and event/metadata interleaving.
Strong Python and PyTorch (or JAX) skills; proficiency with the HuggingFace ecosystem.
MLOps fluency: experiment tracking, model registry design, reproducible pipelines, and automated retraining.
Excellent written and verbal English communication skills.

Nice To Haves

Familiarity with published time-series foundation model approaches (Chronos, Moirai, TimesFM, or similar) — a significant advantage.
Experience with uncertainty quantification in forecasting: Gaussian, mixture, or quantile output heads.
Background in scaling-law estimation for model capacity planning.
Exposure to multi-modal training corpora combining continuous signals, discrete operational events, and structured metadata.
Renewable energy, SCADA, or industrial IoT data experience — including an understanding of signal quality issues (sparsity, flatlines, sensor drift) in real-world deployments.
Experience evaluating and selecting attention variants and context/horizon sizing for long-sequence tasks.
Published research, open-source contributions, or patents in time-series modelling or foundation models.
Knowledge of curriculum learning and masking strategies for pre-training.

Responsibilities

Own architecture decisions for PF's foundational time-series model: head choice, attention variant, context/horizon sizing, and multi-frequency handling — grounded in empirical evidence from the pilot dataset.
Design the tokenization strategy for PF's multi-modal training corpus: quantization scheme, multi-frequency handling, and event/metadata interleaving.
Establish scaling-law estimates on the pilot dataset to project fleet-scale capacity requirements.
Track time-series foundation model literature and translate relevant findings into the PF training context.
Design an experimental framework to validate business value for target use cases.
Build and maintain a containerized, reproducible training environment with full experiment tracking and baseline comparisons.
Optimize the distributed training pipeline: throughput, memory layout, gradient accumulation, checkpointing, and fault recovery.
Design and implement the model registry, linking config, metrics, dataset version, and code SHA for every artifact.
Own the pre-training recipe: learning rate schedules, masking/curriculum strategies, and validation protocols.
Scale training from pilot to the full ~1,000-site universe, including per-asset-class and per-OEM normalization.
Handle real-world data quality issues: outliers, flatlines, missing sensors, and irregular sampling.
Report baseline metrics per asset class, OEM, and capacity bucket; iterate based on shadow validation.
Partner with the Backend/Data Engineer on data quality standards, feature store design, and pipeline interfaces.
Collaborate with the Tech Lead and Product team to ensure model outputs meet pilot customer requirements.
Document training runbooks, debugging procedures, and architecture decisions to enable team-wide operability.

Benefits

Comprehensive benefits package including health, dental, and vision coverage, plus dedicated wellness support
Generous paid vacation policy
Employer RRSP matching program
Work-from-abroad opportunities with manager approval
Exposure to a global team operating across multiple countries and time zones

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume