Kernel Engineer (Compute / Accelerator)

DensityAI•Mountain View, CA

13d•$260,000 - $320,000

About The Position

You will write, evaluate, and profile specialized compute kernels that run on a custom AI accelerator. This is the critical interface between high-level ML workloads and silicon — your code directly determines how effectively the hardware performs. You'll work closely with the architecture and compiler teams to define the kernel programming model, implement core tensor operations, and drive the performance profiling workflow that validates silicon design decisions.

Requirements

C/C++ — production-grade systems code, not scripted glue. You'll write performance-critical kernels
CUDA or equivalent accelerator programming — deep experience writing GPU kernels, understanding warp/wavefront execution, memory coalescing, shared memory optimization. The mental model transfers directly
Computer architecture — you need to reason about pipelines, memory hierarchies, data movement costs, and how software maps to hardware
Performance profiling and optimization — you live in profilers. Identifying bottlenecks, measuring throughput, and iterating until kernels meet targets is the core loop
Tensor operations — practical understanding of GEMM, convolution, attention, reduction, and scatter/gather as they map to hardware
Python — for scripting, DSL integration, and profiling automation

Nice To Haves

RISC-V, x86, or ARM64 ISA experience
MLIR or LLVM compiler infrastructure
HPC or scientific computing background (large-scale parallel compute intuition)
FPGA or Verilog/SystemVerilog (ability to read RTL and reason about the hardware you're targeting)
Familiarity with CUTLASS, Triton, or similar kernel libraries

Responsibilities

Write and optimize compute kernels for a custom AI accelerator — tensor operations, data movement patterns, memory hierarchy exploitation
Develop and maintain profiling infrastructure to measure kernel performance against architectural targets
Define and document shuffle patterns for ML kernel primitives across CPU-like control, tensor cores, and CUTLASS-style operations
Drive kernel DSL design decisions — thread spawn mechanisms, register passing conventions, and memory management strategies
Enable end-to-end kernel execution on the architectural simulator
Collaborate with the compiler team on the MLIR dialect — your kernels are the primary validation target
Create onboarding documentation and kernel writing guides for the broader team