Skip to content

Latest commit

 

History

History
62 lines (52 loc) · 2.04 KB

README.md

File metadata and controls

62 lines (52 loc) · 2.04 KB

Performance Comparison: cuBLAS 12.4 vs CUTLASS 3.X on NVIDIA Hopper

Introduction

cuBLAS 12.4 has recently been integrated into PyTorch.
This cuBLAS update aims to bridge the significant performance gap
in matrix computations compared to SOTA solutions like CUTLASS
References:

Objective

Our benchmark compares two implementations:

  • PyTorch's Linear layer using cuBLAS 12.4
  • A custom PyTorch extension using CUTLASS 3.X

We chose to perform this comparison in a real-world context integrated
into a high-level framework. This study focuses solely on matrix computation
performance in isolation, without using CUTLASS's advanced optimizations
for specific cases, thus putting cuBLAS in the best comparison conditions.

Configuration

Composant Version
PyTorch 2.51
CUDA 12.4
cuBLAS 12.4
CUTLASS 3.6
CPU AMD EPYC
GPU H100 SMX5

Results

Benchmark Results

Conclusion

This study reveals that:

  1. cuBLAS, despite its late update for the Hopper architecture,
    maintains a performance gap with CUTLASS.
  2. CUTLASS demonstrates advantages for LLMs with:
    • Superior performance
    • An open-source solution allowing code mastery, modifications,
      and advanced optimizations as demonstrated by Flash Attention V3
    • The ability to integrate vertical and horizontal fusions and explicitly
      avoid unnecessary synchronizations in multi-GPU environments

Installation

git clone --recursive https://github.com/MatrixAssembler/hopper-bench.git  
cd hopper-bench  
make

Test

make test

Benchmark

make bench