KERNEL ENGINEER

MakerMaker•San Francisco, CA

3d•Onsite

About The Position

We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site. You'll write and optimize the GPU kernels and supporting systems software that makes our training and inference workloads fast. This is deep, low-level work (performance counters, memory bandwidth, warp-level scheduling) applied to the specific shapes and patterns our models actually use. We hire kernel engineers because the gap between "this works" and "this is fast on the hardware we have" is enormous, and that gap directly bounds what our researchers can try. You'll close that gap.

Requirements

4+ years writing performant GPU kernels (CUDA, ROCm, Triton, or production-grade equivalent)
Hardware-level fluency: memory hierarchy, occupancy, register pressure, tensor cores, warp scheduling
Profiling fluency (Nsight, ncu, or comparable tools) and the discipline to measure before changing
Track record of shipping kernel-level optimizations that moved a measurable metric in a real system
Strong systems expertise: you understand how kernels live inside larger frameworks and how integration choices affect end-to-end performance
Comfortable reading framework-level Python and C++ around your kernels

Nice To Haves

Open-source contributions to kernel libraries, compilers, or ML frameworks
Experience with multiple accelerator architectures (different GPU families, TPUs, custom ASICs), preferably AMD GPUs
Familiarity with collective communication primitives (NCCL or equivalent)
Compiler or runtime background

Responsibilities

Write and optimize GPU kernels (CUDA, ROCm, Triton, or similar) for training and inference workloads: attention variants, MoE layers, custom activations, communication primitives
Profile real workloads with hardware counters and translate findings into specific kernel-level optimizations
Co-design kernels with the research teams, when the kernel and the algorithm need to change together, you participate in both
Integrate optimized kernels into our training and serving stacks; benchmark before and after; verify the win is real end-to-end
Maintain kernel quality over time as hardware, frameworks, and workloads shift underneath
Spread kernel-level fluency across the team; we want this expertise shared, not siloed

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume