Performance

CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit performance comparable to cuBLAS for scalar GEMM computations. The above figure shows CUTLASS performance relative to cuBLAS for large matrix dimensions on an NVIDIA A100, an NVIDIA A2, an NVIDIA TitanV, and an NVIDIA GeForce 2080 Ti compiled with the CUDA 11.5 Toolkit. Tensor Core operations are implemented using CUDA's mma instruction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Clone this wiki locally