Performance Comparison: cuBLAS 12.4 vs CUTLASS 3.X on NVIDIA Hopper

Introduction

cuBLAS 12.4 has recently been integrated into PyTorch.
This cuBLAS update aims to bridge the significant performance gap
in matrix computations compared to SOTA solutions like CUTLASS
References:

Objective

Our benchmark compares two implementations:

PyTorch's Linear layer using cuBLAS 12.4
A custom PyTorch extension using CUTLASS 3.X

We chose to perform this comparison in a real-world context integrated
into a high-level framework. This study focuses solely on matrix computation
performance in isolation, without using CUTLASS's advanced optimizations
for specific cases, thus putting cuBLAS in the best comparison conditions.

Configuration

Composant	Version
PyTorch	2.51
CUDA	12.4
cuBLAS	12.4
CUTLASS	3.6
CPU	AMD EPYC
GPU	H100 SMX5

Results

Conclusion

This study reveals that:

cuBLAS, despite its late update for the Hopper architecture,
maintains a performance gap with CUTLASS.
CUTLASS demonstrates advantages for LLMs with:
- Superior performance
- An open-source solution allowing code mastery, modifications,
  and advanced optimizations as demonstrated by Flash Attention V3
- The ability to integrate vertical and horizontal fusions and explicitly
  avoid unnecessary synchronizations in multi-GPU environments

Installation

git clone --recursive https://github.com/MatrixAssembler/hopper-bench.git  
cd hopper-bench  
make

Test

make test

Benchmark

make bench

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
cutlass_extensions		cutlass_extensions
external		external
layer		layer
test		test
.gitmodules		.gitmodules
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
performance_comparison.png		performance_comparison.png
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance Comparison: cuBLAS 12.4 vs CUTLASS 3.X on NVIDIA Hopper

Introduction

Objective

Configuration

Results

Conclusion

Installation

Test

Benchmark

About

Releases

Packages

Languages

MatrixAssembler/hopper-bench

Folders and files

Latest commit

History

Repository files navigation

Performance Comparison: cuBLAS 12.4 vs CUTLASS 3.X on NVIDIA Hopper

Introduction

Objective

Configuration

Results

Conclusion

Installation

Test

Benchmark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages