-
Notifications
You must be signed in to change notification settings - Fork 978
Performance
Matthew Nicely edited this page May 15, 2022
·
3 revisions
CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit performance comparable to cuBLAS for scalar GEMM computations. The above figure shows CUTLASS performance relative to cuBLAS for large matrix dimensions on an NVIDIA A100, an NVIDIA A2, an NVIDIA TitanV, and an NVIDIA GeForce 2080 Ti compiled with the CUDA 11.5 Toolkit. Tensor Core operations are implemented using CUDA's mma instruction.