cuBLAS 12.4 has recently been integrated into PyTorch.
This cuBLAS update aims to bridge the significant performance gap
in matrix computations compared to SOTA solutions like CUTLASS
References:
- Colfax Kernel Gemm Cublas/Cute/Cutlass Hopper
- Pytorch CUTLASS Ping-Pong GEMM Kernel Hopper
- Pytorch FP8 GEMMs Triton/Cutlass Hopper
Our benchmark compares two implementations:
- PyTorch's Linear layer using cuBLAS 12.4
- A custom PyTorch extension using CUTLASS 3.X
We chose to perform this comparison in a real-world context integrated
into a high-level framework. This study focuses solely on matrix computation
performance in isolation, without using CUTLASS's advanced optimizations
for specific cases, thus putting cuBLAS in the best comparison conditions.
Composant | Version |
---|---|
PyTorch | 2.51 |
CUDA | 12.4 |
cuBLAS | 12.4 |
CUTLASS | 3.6 |
CPU | AMD EPYC |
GPU | H100 SMX5 |
This study reveals that:
- cuBLAS, despite its late update for the Hopper architecture,
maintains a performance gap with CUTLASS. - CUTLASS demonstrates advantages for LLMs with:
- Superior performance
- An open-source solution allowing code mastery, modifications,
and advanced optimizations as demonstrated by Flash Attention V3 - The ability to integrate vertical and horizontal fusions and explicitly
avoid unnecessary synchronizations in multi-GPU environments
git clone --recursive https://github.com/MatrixAssembler/hopper-bench.git
cd hopper-bench
make
make test
make bench