ML Kernel Performance Engineer, Edge AI and Science

AmazonVancouver, BC
CA$114,800 - CA$191,800Onsite

About The Position

Amazon Devices is seeking an ML Kernel Performance Engineer to work at the hardware-software boundary of their advanced compression platform and custom neural accelerator silicon. This role focuses on crafting high-performance CUDA and Triton kernels to optimize neural network compression algorithms for training, fine-tuning, and inference. The engineer will build tooling and kernel libraries to democratize GPU performance optimization, enabling scientists and engineers to profile and diagnose kernel bottlenecks without requiring deep CUDA expertise. The work involves ensuring that novel quantization schemes and sparse computation patterns translate into real throughput gains on GPU hardware, directly accelerating training runs and enabling the deployment of compressed models to edge devices and cloud inference.

Requirements

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Experience with CUDA kernels or ML/low-level kernels, or experience in developing and deploying LLMs in production on GPUs, Neuron, TPU or other AI acceleration hardware
  • Experience with programming languages such as Python, Java, C++

Nice To Haves

  • Bachelor's degree in computer science or equivalent
  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Experience with GPU kernel optimization and GPGPU computing (CUDA, Triton, SYCL, or ROCm)
  • Proficiency in low-level performance optimization for GPUs
  • Understanding of GPU memory hierarchies and optimization strategies (shared memory, L1/L2 cache, register pressure, memory coalescing)
  • Experience developing high-performance libraries for ML or HPC applications
  • Knowledge of ML frameworks (PyTorch, TensorFlow) and their GPU backends
  • Experience implementing custom PyTorch operators (torch.autograd.Function, C++ extensions)
  • Experience with parallel programming and optimization techniques
  • Background in neural network compression (quantization, pruning, knowledge distillation, low-rank factorization)
  • Knowledge of mixed-precision training and inference (FP16, BF16, FP8, INT8, INT4)
  • Experience with inference optimization (TensorRT, ONNX Runtime, vLLM, or similar)
  • Familiarity with Transformer architectures, attention mechanisms, and their compute/memory profiles
  • Experience with AWS Trainium/Inferentia or the Neuron Kernel Interface (NKI)
  • Experience with edge deployment, model compilation, or hardware-aware optimization

Responsibilities

  • Design and implement high-performance CUDA and Triton kernels for quantization-aware training, sparse matrix operations, and low-bit inference on modern GPU accelerators.
  • Analyze and optimize kernel-level performance for compression training workloads, conducting detailed performance analysis using profiling tools to identify and resolve bottlenecks.
  • Implement kernel-level optimizations such as operator fusion, tiling, memory access pattern optimization, and scheduling for compression-specific compute patterns.
  • Build a kernel development harness that enables any team member to profile kernel performance, test forward/backward accuracy, and validate at production scale.
  • Maintain and extend the team's training kernels library with clean interfaces, CI, and examples.
  • Collaborate closely with Applied Scientists, compiler engineers, and hardware architects to co-design ML-centric solutions.
  • Develop inference kernels for cloud deployment.
  • Build and maintain performance regression tests and benchmarking infrastructure that track kernel efficiency as models scale.

Benefits

  • health insurance (medical, dental, vision, prescription, basic life & AD&D insurance)
  • Registered Retirement Savings Plan (RRSP)
  • Deferred Profit Sharing Plan (DPSP)
  • paid time off
  • other resources to improve health and well-being
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service