Senior ML Performance Engineer

Lemurian LabsToronto, ON

About The Position

We're looking for a Senior ML Performance Engineer to architect and lead our Performance Testing Platform from the ground up. You'll be the technical authority on how we measure, validate, and optimize the performance of large language models — including Llama 3.2 70B, DeepSeek, and others — before and after compiler optimization on modern GPU architectures. This is a high-impact role at the intersection of ML systems, GPU architecture, and performance engineering. You'll build the infrastructure that proves our compiler delivers real, measurable value — and you'll work directly with compiler and ML engineers to drive the optimizations that get us there.

Requirements

  • BS degree in computer science, computer engineering, electrical engineering, or equivalent practical experience
  • 7+ years of experience in performance engineering, benchmarking, or systems engineering roles
  • Deep understanding of ML inference workloads, particularly transformer-based models and LLMs
  • Hands-on experience with GPU programming and optimization (CUDA, ROCm, or similar)
  • Strong programming skills in Python and C/C++
  • Proven track record of building performance testing infrastructure or benchmarking platforms from scratch
  • Experience with ML frameworks (PyTorch, TensorFlow, ONNX Runtime, vLLM, TensorRT-LLM, etc.)
  • Proficiency with profiling and debugging tools for GPU workloads
  • Strong analytical skills with the ability to design experiments, analyze results, and communicate findings clearly
  • Experience with CI/CD systems and test automation frameworks

Nice To Haves

  • Masters or PhD degree in computer science, computer engineering, electrical engineering, or equivalent practical experience.
  • Experience with AMD GPUs (Mi200/Mi300 series) and ROCm ecosystem
  • Knowledge of compiler optimization techniques and their impact on performance
  • Experience with distributed inference and multi-GPU workloads
  • Familiarity with ML model quantization, pruning, and other optimization techniques
  • Background in high-performance computing or systems-level optimization
  • Experience with infrastructure-as-code (Kubernetes, Docker, Terraform)
  • Contributions to open-source ML or systems projects

Responsibilities

  • Design and build a comprehensive performance testing platform for evaluating LLM inference workloads across GPU clusters
  • Define and implement the benchmarking methodology, metrics, and test suites that measure latency, throughput, memory utilization, power consumption, and model accuracy
  • Establish baseline performance for unoptimized models (Llama 3.2 70B, DeepSeek, etc.) and validate post-optimization improvements
  • Develop automated testing pipelines for continuous performance validation across compiler releases and model updates
  • Investigate performance bottlenecks using profiling tools (ROCm profilers, GPU traces, system-level monitoring) and work with the compiler team to drive optimizations
  • Create dashboards and reporting that provide clear visibility into performance trends, regressions, and wins
  • Collaborate cross-functionally with compiler engineers, ML engineers, and DevOps to ensure performance testing is integrated into our development workflow
  • Document best practices for performance testing and optimization of ML workloads on GPU hardware

Benefits

  • Competitive compensation including equity, medical/dental/vision, retirement savings, and wellness benefits.
  • Additional benefits include equity, company bonus opportunities, medical, dental, and vision coverage, a retirement savings plan, and supplemental wellness benefits.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service