• Expand our in-house analytical GPU framework to make it amenable to workload profiling and validate the updates with Gainsight • Identify opportunities for eDRAM (including refresh-free operation) according to data types (activation, KV cache, weights) and a set of relevant workloads for multi-GPU (single server) inference • Identify and contrast the opportunities for LLM prefill and decode separately, considering a long context length • Benchmark various eDRAM options (1T1C BEOL, 2TGC hybrid, 2TGC BEOL) vs. the SRAM baseline with respect to workload-level inference energy and latency at iso-area and iso-capacity • Some additional questions: • How can we modify the GPU architecture to make the best out of eDRAM? • Which AI workloads would benefit the most from eDRAM? • What are the architectural and algorithmic options to maximize a refresh-free operation? • How much benefit would eDRAM bring into AI training?