Feature request: support for matmul with integer matrices #64

DilumAluthge · 2021-01-24T03:04:57Z

As far as I can tell, cuBLAS does not support matrix multiplication with integer matrices. When I try, I get errors that look like this:

julia> CUDA.CUBLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_cublas)
ERROR: ArgumentError: gemmEx does not support Int32=Int32*Int32

So, I was thinking that it could be a good use case for GemmKernels if it could support matmul for integer matrices.

Unfortunately, currently it looks like GemmKernels doesn't support integer matrices, for example:

julia> GemmKernels.BLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_gemmkernels)
ERROR: MethodError: no method matching shared_layout_ab(::Type{CUDA.CuArray{Int32,2}}, ::Val{false})
Closest candidates are:
  shared_layout_ab(::Type{CUDA.CuArray{Float16,N}}, ::Any) where N at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:13
  shared_layout_ab(::Type{LinearAlgebra.Diagonal{Float16,CUDA.CuArray{Float16,N}}}, ::Any) where {N, P} at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:14
Stacktrace:
 [1] gemmEx!(::Char, ::Char, ::Int32, ::CUDA.CuArray{Int32,2}, ::CUDA.CuArray{Int32,2}, ::Int32, ::CUDA.CuArray{Int32,2}) at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:46
 [2] top-level scope at REPL[34]:1

Would it be possible to add support for integer matrices to GemmKernels?

Here is a full MWE:

julia> import CUDA

julia> import GemmKernels

julia> import InteractiveUtils

julia> import Test

julia> import Pkg

julia> Pkg.status()
Status `/gpfs/home/daluthge/.julia/environments/v1.5/Project.toml`
  [052768ef] CUDA v2.4.1
  [312cec97] GemmKernels v0.1.0 `https://github.com/JuliaGPU/GemmKernels.jl#master`

julia> InteractiveUtils.versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 5122 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_NUM_THREADS = auto
  JULIA_PKG_SERVER =
  JULIA_CUDA_VERBOSE = true

julia> CUDA.versioninfo()
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.27.4
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.5.3
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

Environment:
- JULIA_CUDA_VERBOSE: true

1 device:
  0: TITAN V (sm_70, 11.346 GiB / 11.784 GiB available)

julia> CUDA.capability.(CUDA.devices())
1-element Array{VersionNumber,1}:
 v"7.0.0"

julia> M = N = K = 128
128

julia> T = Int16
Int16

julia> alpha = T(1)
1

julia> beta = T(0)
0

julia> a_h = rand(T, (M, K));

julia> b_h = rand(T, (K, N));

julia> c_h = rand(T, (M, N));

julia> a  = CUDA.CuArray(deepcopy(a_h));

julia> b  = CUDA.CuArray(deepcopy(b_h));

julia> c_cublas = CUDA.CuArray(deepcopy(c_h));

julia> c_gemmkernels = CUDA.CuArray(deepcopy(c_h));

julia> CUDA.CUBLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_cublas)
ERROR: ArgumentError: gemmEx does not support Int16=Int16*Int16
Stacktrace:
 [1] gemmEx!(::Char, ::Char, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T; algo::CUDA.CUBLAS.cublasGemmAlgo_t) at /users/daluthge/.julia/packages/CUDA/wTQsK/lib/cublas/wrappers.jl:829
 [2] gemmEx!(::Char, ::Char, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T) at /users/daluthge/.julia/packages/CUDA/wTQsK/lib/cublas/wrappers.jl:819
 [3] top-level scope at REPL[21]:1

julia> GemmKernels.BLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_gemmkernels)
ERROR: MethodError: no method matching shared_layout_ab(::Type{CUDA.CuArray{Int16,2}}, ::Val{false})
Closest candidates are:
  shared_layout_ab(::Type{CUDA.CuArray{Float16,N}}, ::Any) where N at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:13
  shared_layout_ab(::Type{LinearAlgebra.Diagonal{Float16,CUDA.CuArray{Float16,N}}}, ::Any) where {N, P} at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:14
Stacktrace:
 [1] gemmEx!(::Char, ::Char, ::Int16, ::CUDA.CuArray{Int16,2}, ::CUDA.CuArray{Int16,2}, ::Int16, ::CUDA.CuArray{Int16,2}) at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:46
 [2] top-level scope at REPL[22]:1

julia> T = Int32
Int32

julia> alpha = T(1)
1

julia> beta = T(0)
0

julia> a_h = rand(T, (M, K));

julia> b_h = rand(T, (K, N));

julia> c_h = rand(T, (M, N));

julia> a  = CUDA.CuArray(deepcopy(a_h));

julia> b  = CUDA.CuArray(deepcopy(b_h));

julia> c_cublas = CUDA.CuArray(deepcopy(c_h));

julia> c_gemmkernels = CUDA.CuArray(deepcopy(c_h));

julia> CUDA.CUBLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_cublas)
ERROR: ArgumentError: gemmEx does not support Int32=Int32*Int32
Stacktrace:
 [1] gemmEx!(::Char, ::Char, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T; algo::CUDA.CUBLAS.cublasGemmAlgo_t) at /users/daluthge/.julia/packages/CUDA/wTQsK/lib/cublas/wrappers.jl:829
 [2] gemmEx!(::Char, ::Char, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T) at /users/daluthge/.julia/packages/CUDA/wTQsK/lib/cublas/wrappers.jl:819
 [3] top-level scope at REPL[33]:1

julia> GemmKernels.BLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_gemmkernels)
ERROR: MethodError: no method matching shared_layout_ab(::Type{CUDA.CuArray{Int32,2}}, ::Val{false})
Closest candidates are:
  shared_layout_ab(::Type{CUDA.CuArray{Float16,N}}, ::Any) where N at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:13
  shared_layout_ab(::Type{LinearAlgebra.Diagonal{Float16,CUDA.CuArray{Float16,N}}}, ::Any) where {N, P} at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:14
Stacktrace:
 [1] gemmEx!(::Char, ::Char, ::Int32, ::CUDA.CuArray{Int32,2}, ::CUDA.CuArray{Int32,2}, ::Int32, ::CUDA.CuArray{Int32,2}) at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:46
 [2] top-level scope at REPL[34]:1

cc: @vchuravy @thomasfaingnaert @maleadt

DilumAluthge · 2021-01-24T03:38:08Z

Some Googling suggests that cuBLAS does in fact support integer matrices. But I can't get it to work when using the CUDA.CUBLAS.gemmEx! function.

Even if cuBLAS does indeed support integer matrices, I think it would be great to also have support for integer matrices in GemmKernels.jl.

thomasfaingnaert · 2021-01-24T11:30:08Z

Tensor Cores of the Turing generation do support integer GEMM, but only with 1-bit, 4-bit, or 8-bit integer inputs, and 32-bit accumulation, so I'm guessing you would need to pass e.g. an 8-bit A and B matrix and a 32-bit C and D matrix to CUDA.CUBLAS.gemmEx!.

With regards to integer support for GemmKernels, I personally don't have a lot of time to continue working on it at the moment, but if anyone wants to add support, I'd be happy to answer any questions.

One would need to extend the WMMA support in CUDA.jl to also include the Turing-generation Tensor Core operations first, and then see which adaptations to GemmKernels.jl are necessary.

maleadt · 2021-01-24T18:20:43Z

Or, alternatively, a non-WMMA code path. @thomasfaingnaert, we discussed that at some point, but could you detail what that would require? If we want to fold this into CUDA.jl or GPUArrays.jl we'll need such an option anyway.

thomasfaingnaert · 2021-01-25T09:22:58Z

Or, alternatively, a non-WMMA code path. @thomasfaingnaert, we discussed that at some point, but could you detail what that would require? If we want to fold this into CUDA.jl or GPUArrays.jl we'll need such an option anyway.

Off the top of my head, adding support for new datatypes requires the following:

The AlignedRowMajor and AlignedColMajor layouts should be reusable directly. One would need to generalise the memory vectorisation code, though. I had to use some ugly workarounds to get around the fact that Float16 was mapped to i16 instead of half, but that shouldn't be a problem in Julia 1.6, so that code can be cleaned up significantly. Would it be an idea to add support for explicit vectorisation directly in CUDA.jl, so that one may do, e.g.

# arr1, arr2 = 512 x 512 Float32 CuArray
@aligned arr1[1, 5:8] = arr2[1, 5:8]

which automatically uses vectorised loads and stores?

Add a (WMMA/non-WMMA) operator that can handle more than just FP16/FP32 data types. Ideally, we want to add a generic operator that is used as a fallback in case the datatype is not compatible with WMMA / the user's GPU does not support WMMA.
Finally, some experimentation or brute force search to determine the optimal tiling parameters for this data type.

thomasfaingnaert mentioned this issue Nov 15, 2021

FPU operator #81

Closed

thomasfaingnaert linked a pull request May 15, 2023 that will close this issue

FPU operator #101

Merged

thomasfaingnaert closed this as completed in #101 Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: support for matmul with integer matrices #64

Feature request: support for matmul with integer matrices #64

DilumAluthge commented Jan 24, 2021

DilumAluthge commented Jan 24, 2021

thomasfaingnaert commented Jan 24, 2021

maleadt commented Jan 24, 2021

thomasfaingnaert commented Jan 25, 2021

Feature request: support for matmul with integer matrices #64

Feature request: support for matmul with integer matrices #64

Comments

DilumAluthge commented Jan 24, 2021

DilumAluthge commented Jan 24, 2021

thomasfaingnaert commented Jan 24, 2021

maleadt commented Jan 24, 2021

thomasfaingnaert commented Jan 25, 2021