Lead Inference Platform Support Engineer - AI I

Thomson Reuters•Toronto, ON

7d•Hybrid

About The Position

Thomson Reuters is seeking a Lead Inference Platform Engineer. This role is for someone who has specialized experience in machine learning/deep learning domains such as model compression, hardware aware model optimizations, hardware accelerators architecture, GPU/ASIC architecture, machine learning compilers, high performance computing, performance optimizations, numerics or SW/HW co-design. As a Lead Inference Platform Engineer, you will optimize LLMs and ML models for high-performance inference, deploy and scale inference workloads on GPUs across AWS, Azure, GCP and internal Kubernetes clusters, implement routing and failover strategies for OpenAI/Anthropic/Vertex AI traffic, integrate models into production grade APIs supporting TR products and enterprise workflows, develop highly optimized environments and eliminate performance bottlenecks to reduce latency, collaborate with Platform Engineering teams to ensure inference workloads align with TR’s cloud native patterns, build and optimize containerized inference pipelines using Kubernetes for large‑scale distributed workloads, ensure compliance with TR’s AI standards for deployment, monitoring, governance, and drift detection, profile inference performance, identify GPU/CPU bottlenecks, and optimize compute utilization across heterogeneous hardware, implement observability and health monitoring for inference pipelines, ensuring reliability of enterprise AI services, collaborate with platform teams to enhance capacity forecasting for AI workloads, work with Product, Data Science, Architecture, and Enterprise AI teams to onboard new research models into production, and collaborate closely with AI engineers to invent new quantization techniques, improve numerical precision, and explore non‑standard architectures. You will also partner with Cloud Engineers to develop guardrails and automation that support inference workload, and support the scale out of AI infrastructure during critical releases and global product rollouts.

Requirements

Strong understanding of ML/LLM fundamentals and inference optimization techniques.
Hands‑on experience with GPU programming (CUDA preferred), inference runtimes (TensorRT, ONNX Runtime), and deep learning frameworks (PyTorch/TensorFlow)
Proficiency in Python and at least one systems language (C++ strongly preferred for performance critical inference paths)
Experience deploying AI workloads to AWS/GCP/Azure and Kubernetes
Familiarity with vector search systems (OpenSearch vectors) and retrieval augmented generation pipelines
Knowledge of distributed systems, microservices, CI/CD, and cloud native architecture
Experience with AI networks, such as CNNs, transformers, and diffusion model architectures, and their performance characteristics
Understanding of GPU, Multithreading and/or other accelerators with vectorized instructions
Specialized experience in one or more of the following machine learning/deep learning domains: Model compression, hardware aware model optimizations, hardware accelerators architecture, GPU/ASIC architecture, machine learning compilers, high performance computing, performance optimizations, numerics and SW/HW co-design.

Nice To Haves

3+ years production experience deploying ML/LLM models at scale
Experience in managing GPU fleets or inference clusters across public cloud and container platforms.
Experience supporting enterprise grade AI workloads in regulated or compliance heavy environments.

Responsibilities

Optimize LLMs and ML models for high-performance inference using techniques such as quantization, pruning, distillation, and hardware specific tuning
Deploy and scale inference workloads on GPUs across AWS, Azure, GCP and internal Kubernetes clusters, ensuring predictable performance during peak traffic hours, especially during business hours
Implement routing and failover strategies for OpenAI/Anthropic/Vertex AI traffic
Integrate models into production grade APIs supporting TR products and enterprise workflows.
Develop highly optimized environment and eliminate performance bottlenecks to reduce latency
Collaborate with Platform Engineering teams (Landing Zones, Network, Storage, Compute, AI) to ensure inference workloads align with TR’s cloud native patterns (AWS, Azure, GCP, OCI)
Build and optimize containerized inference pipelines using Kubernetes for large‑scale distributed workloads
Ensure compliance with TR’s AI standards for deployment, monitoring, governance, and drift detection
Profile inference performance, identify GPU/CPU bottlenecks, and optimize compute utilization across heterogeneous hardware
Implement observability and health monitoring for inference pipelines, ensuring reliability of enterprise AI services.
Collaborate with platform teams to enhance capacity forecasting for AI workloads
Work with Product, Data Science, Architecture, and Enterprise AI teams to onboard new research models into production
Collaborates closely with AI engineers to invent new quantization techniques, improve numerical precision, and explore non‑standard architectures.
Partner with Cloud Engineers (Azure, AWS, GCP) to develop guardrails and automation that support inference workload.
Support the scale out of AI infrastructure during critical releases and global product rollouts.