Senior Compiler Engineer, GPU Code Object Rewriting & Tooling

Advanced Micro Devices, Inc•San Jose, CA

7h•Hybrid

About The Position

We are building next-generation infrastructure to predict, explain, and improve AMD GPU kernel performance across current and future architectures, including cases where final hardware is not yet available. This role is for a deeply technical performance-modeling leader who understands GPU hardware and software end to end: ISA, compiler code generation, memory hierarchy, schedulers, matrix units, occupancy, profiling, simulation, and kernel behavior. You will help define how we model new GPU architectures, validate those models against hardware and simulators, and teach teams across compiler, runtime, architecture, and performance libraries how to reason about GPU performance. The work sits at the intersection of GPU performance modeling, compiler analysis, microarchitecture, profiler/simulator validation, and architecture-aware optimization. Hardware-software co-design experience is highly valuable, especially for influencing future ISA features, compiler-visible architecture choices, memory-system behavior, and matrix/tensor pipelines.

Requirements

Deep understanding of GPU microarchitecture and execution models, including waves/warps, SIMD/SIMT execution, registers, shared/local memory, caches, memory coalescing, barriers, occupancy, latency hiding, and scheduling.
Strong quantitative performance intuition: instruction throughput, dependency chains, memory bandwidth, cache effects, occupancy cliffs, issue bottlenecks, and resource contention.
Experience building analytical, trace-driven, simulation-based, or compiler-assisted performance models.
Strong C++ systems programming skills and experience building production-quality low-level tools.
Experience with compilers, compiler IRs, machine-level code generation, static/dynamic analysis, or target-specific optimization.
Ability to read and reason about low-level assembly, ISA encodings, compiler output, and profiler traces.
Experience analyzing performance using profilers, hardware counters, traces, simulators, microbenchmarks, or custom instrumentation.
Technical leadership: ability to set direction, influence across teams, mentor others, and explain complex hardware/software behavior clearly.
Strong validation mindset: you care about evidence, counterexamples, error bars, and avoiding misleading point estimates.

Nice To Haves

AMDGPU, GCN, RDNA, CDNA, ROCm, HIP, HSA, or the AMDGPU LLVM backend.
GPU performance tuning for HPC, AI, graphics, or performance libraries.
GPU architecture modeling, cycle simulators, trace-driven simulators, analytical performance models, or silicon bring-up.
Hardware-software co-design for new ISA features, memory systems, matrix/tensor units, schedulers, or compiler-visible architecture features.
Matrix/tensor instructions such as MFMA, WMMA, tensor cores, or other specialized math pipelines.
ROCm profiling tools, hardware performance counters, thread traces, ROCprof, ROCm Compute Profiler, Nsight Compute, or similar tooling.
Modeling LDS/shared-memory bank conflicts, cache behavior, memory coalescing, atomics, synchronization, barriers, and tail effects.
LLVM, MLIR, GCC, or other production compiler infrastructure.
Binary analysis, disassembly, LLVM MC, ELF/code-object metadata, DWARF/source correlation, or post-link analysis.
Machine learning for performance modeling, trace analysis, anomaly detection, autotuning, or learned residual correction.

Responsibilities

Lead the design of an explainable GPU performance-estimation system for AMDGPU kernels.
Build models from ISA, compiler output, code-object metadata, profiler data, simulator traces, microbenchmarks, and architecture facts.
Model key performance drivers: wave scheduling, occupancy, VGPR/SGPR pressure, matrix pipelines, VALU/SALU issue pressure, VMEM/SMEM traffic, LDS bank conflicts, cache behavior, memory coalescing, waitcnt dependencies, barriers, and latency hiding.
Use current-generation GPUs, hardware counters, synthetic kernels, and simulator traces to calibrate and validate predictions for future architectures.
Produce reports with estimated cycles, lower and upper bounds, uncertainty, bottleneck attribution, missing facts, and source/PC-level explanations.
Partner with GPU architecture, compiler, runtime, profiler, simulator, and performance-library teams to turn modeling results into better hardware, better compilers, and faster kernels.
Mentor engineers, write design docs, lead technical reviews, and educate teams on GPU architecture and performance behavior.
Explore ML-assisted modeling where it improves calibration, residual prediction, anomaly detection, microbenchmark selection, autotuning, or trace analysis while keeping the core model explainable and hardware-grounded.