Skip to content

Optimized Single-Precision General Matrix Multiplication (SGEMM) using CUDA, achieving 89% of cuBLAS performance.

Notifications You must be signed in to change notification settings

Tingwei-Jen/SGEMM_Optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SGEMM_Optimization

Running on NVIDIA GeForce RTX 3070 Ti

GFLOPS at 4096x4096:

Kernels GFLOPS Performance relative to cuBLAS
Naive 1395 9.5%
SMEM Method 1316 9.0%
2D Tiling 6839 46.9%
Solve Bank Conflicts (Padding) 7776 53.3%
Register 7522 51.5%
Float4 12601 86.3%
Float4 + Prefetch 9006 61.7%
Tuning 12942 88.7%
cuBLAS 14597 100.0%

About

Optimized Single-Precision General Matrix Multiplication (SGEMM) using CUDA, achieving 89% of cuBLAS performance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published