ML Kernel Performance Engineer, Edge AI and Science

AmazonSunnyvale, CA
$165,200 - $223,600Onsite

About The Position

Amazon Devices is seeking an ML Kernel Performance Engineer to work at the hardware-software boundary of their advanced compression platform and custom neural accelerator silicon. This role focuses on crafting high-performance CUDA and Triton kernels to optimize machine learning model compression algorithms for training, fine-tuning, and inference. The engineer will build tooling and kernel libraries to democratize GPU performance optimization, enabling scientists and engineers to profile and resolve kernel bottlenecks without requiring deep CUDA expertise. The work involves ensuring novel quantization schemes and sparse computation patterns translate into real throughput gains on GPU hardware, directly accelerating training runs and enabling the deployment of compressed models to edge devices and cloud inference.

Requirements

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • Knowledge of Python and/or C++ programming
  • Experience with CUDA kernels or ML/low-level kernels, or experience in developing and deploying LLMs in production on GPUs, Neuron, TPU or other AI acceleration hardware

Nice To Haves

  • Bachelor's degree in computer science or equivalent
  • 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
  • Experience with GPU kernel optimization and GPGPU computing (CUDA, Triton, SYCL, or ROCm)
  • Proficiency in low-level performance optimization for GPUs
  • Understanding of GPU memory hierarchies and optimization strategies (shared memory, L1/L2 cache, register pressure, memory coalescing)
  • Experience developing high-performance libraries for ML or HPC applications
  • Knowledge of ML frameworks (PyTorch, TensorFlow) and their GPU backends
  • Experience implementing custom PyTorch operators (torch.autograd.Function, C++ extensions)
  • Experience with parallel programming and optimization techniques
  • Background in neural network compression (quantization, pruning, knowledge distillation, low-rank factorization)
  • Knowledge of mixed-precision training and inference (FP16, BF16, FP8, INT8, INT4)
  • Experience with inference optimization (TensorRT, ONNX Runtime, vLLM, or similar)
  • Familiarity with Transformer architectures, attention mechanisms, and their compute/memory profiles
  • Experience with AWS Trainium/Inferentia or the Neuron Kernel Interface (NKI)
  • Experience with edge deployment, model compilation, or hardware-aware optimization

Responsibilities

  • Design and implement high-performance CUDA and Triton kernels for quantization-aware training, sparse matrix operations, and low-bit inference on modern GPU accelerators.
  • Analyze and optimize kernel-level performance for compression training workloads, conducting detailed performance analysis using profiling tools to identify and resolve bottlenecks.
  • Implement kernel-level optimizations such as operator fusion, tiling, memory access pattern optimization, and scheduling for compression-specific compute patterns.
  • Build a kernel development harness that enables any team member to profile kernel performance, test forward/backward accuracy, and validate at production scale.
  • Maintain and extend the team's training kernels library with clean interfaces, CI, and examples.
  • Collaborate closely with Applied Scientists, compiler engineers, and hardware architects to co-design ML-centric solutions.
  • Develop inference kernels for cloud deployment.
  • Build and maintain performance regression tests and benchmarking infrastructure.

Benefits

  • health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
  • 401(k) matching
  • paid time off
  • parental leave
  • sign-on payments
  • restricted stock units (RSUs)
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service