ML Kernel Performance Engineer, Edge AI and Science

Amazon•Sunnyvale, CA

2d•$165,200 - $223,600•Onsite

About The Position

Amazon Devices is seeking an ML Kernel Performance Engineer to work at the hardware-software boundary of their advanced compression platform and custom neural accelerator silicon. This role focuses on crafting high-performance CUDA and Triton kernels to optimize machine learning model compression algorithms for training, fine-tuning, and inference. The engineer will build tooling and kernel libraries to democratize GPU performance optimization, enabling scientists and engineers to profile and resolve kernel bottlenecks without requiring deep CUDA expertise. The work involves ensuring novel quantization schemes and sparse computation patterns translate into real throughput gains on GPU hardware, directly accelerating training runs and enabling the deployment of compressed models to edge devices and cloud inference.

Requirements

3+ years of non-internship professional software development experience
2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Knowledge of Python and/or C++ programming
Experience with CUDA kernels or ML/low-level kernels, or experience in developing and deploying LLMs in production on GPUs, Neuron, TPU or other AI acceleration hardware

Nice To Haves

Bachelor's degree in computer science or equivalent
3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Experience with GPU kernel optimization and GPGPU computing (CUDA, Triton, SYCL, or ROCm)
Proficiency in low-level performance optimization for GPUs
Understanding of GPU memory hierarchies and optimization strategies (shared memory, L1/L2 cache, register pressure, memory coalescing)
Experience developing high-performance libraries for ML or HPC applications
Knowledge of ML frameworks (PyTorch, TensorFlow) and their GPU backends
Experience implementing custom PyTorch operators (torch.autograd.Function, C++ extensions)
Experience with parallel programming and optimization techniques
Background in neural network compression (quantization, pruning, knowledge distillation, low-rank factorization)
Knowledge of mixed-precision training and inference (FP16, BF16, FP8, INT8, INT4)
Experience with inference optimization (TensorRT, ONNX Runtime, vLLM, or similar)
Familiarity with Transformer architectures, attention mechanisms, and their compute/memory profiles
Experience with AWS Trainium/Inferentia or the Neuron Kernel Interface (NKI)
Experience with edge deployment, model compilation, or hardware-aware optimization

Responsibilities

Design and implement high-performance CUDA and Triton kernels for quantization-aware training, sparse matrix operations, and low-bit inference on modern GPU accelerators.
Analyze and optimize kernel-level performance for compression training workloads, conducting detailed performance analysis using profiling tools to identify and resolve bottlenecks.
Implement kernel-level optimizations such as operator fusion, tiling, memory access pattern optimization, and scheduling for compression-specific compute patterns.
Build a kernel development harness that enables any team member to profile kernel performance, test forward/backward accuracy, and validate at production scale.
Maintain and extend the team's training kernels library with clean interfaces, CI, and examples.
Collaborate closely with Applied Scientists, compiler engineers, and hardware architects to co-design ML-centric solutions.
Develop inference kernels for cloud deployment.
Build and maintain performance regression tests and benchmarking infrastructure.

Benefits

health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage)
401(k) matching
paid time off
parental leave
sign-on payments
restricted stock units (RSUs)

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume