Design and implement high-performance compute kernels for AI primitives such as GEMM, attention, normalization, and convolution. Optimize for throughput, latency, and memory hierarchy across heterogeneous compute units (SIMD, matrix engines, DMA). Collaborate with compiler and runtime teams to integrate kernels into Triton, PyTorch, or SYCL pipelines. Profile and tune kernels using tools like Perfetto, VTune, Tracy, or custom simulators. Prototype and evaluate precision formats (FP16/BF16/FP8/e5m2, etc.) and stochastic rounding. Contribute to micro-architecture feedback loops, helping co-design ISA and memory features with the hardware team. Write clear, well-structured, and reusable code (C++/CUDA/Triton/LLVM MLIR).
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level