This project investigates NVIDIA cuBLAS emulation techniques that achieve high-precision (FP32 and FP64) results using low-precision hardware such as Tensor Cores, which can deliver significant speedups on modern architectures. While these approaches offer substantial performance gains, they may introduce hidden costs in memory usage, dispatch latency, and energy consumption. The study characterizes how input data distributions influence throughput and quantifies the associated memory and execution overhead. Ultimately, it identifies when emulation is both numerically safe and performance-efficient versus when native precision paths are preferable.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Intern
Education Level
No Education Listed