Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: support for matmul with integer matrices #64

Closed
DilumAluthge opened this issue Jan 24, 2021 · 4 comments · Fixed by #101
Closed

Feature request: support for matmul with integer matrices #64

DilumAluthge opened this issue Jan 24, 2021 · 4 comments · Fixed by #101

Comments

@DilumAluthge
Copy link
Contributor

As far as I can tell, cuBLAS does not support matrix multiplication with integer matrices. When I try, I get errors that look like this:

julia> CUDA.CUBLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_cublas)
ERROR: ArgumentError: gemmEx does not support Int32=Int32*Int32

So, I was thinking that it could be a good use case for GemmKernels if it could support matmul for integer matrices.

Unfortunately, currently it looks like GemmKernels doesn't support integer matrices, for example:

julia> GemmKernels.BLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_gemmkernels)
ERROR: MethodError: no method matching shared_layout_ab(::Type{CUDA.CuArray{Int32,2}}, ::Val{false})
Closest candidates are:
  shared_layout_ab(::Type{CUDA.CuArray{Float16,N}}, ::Any) where N at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:13
  shared_layout_ab(::Type{LinearAlgebra.Diagonal{Float16,CUDA.CuArray{Float16,N}}}, ::Any) where {N, P} at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:14
Stacktrace:
 [1] gemmEx!(::Char, ::Char, ::Int32, ::CUDA.CuArray{Int32,2}, ::CUDA.CuArray{Int32,2}, ::Int32, ::CUDA.CuArray{Int32,2}) at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:46
 [2] top-level scope at REPL[34]:1

Would it be possible to add support for integer matrices to GemmKernels?


Here is a full MWE:

julia> import CUDA

julia> import GemmKernels

julia> import InteractiveUtils

julia> import Test

julia> import Pkg

julia> Pkg.status()
Status `/gpfs/home/daluthge/.julia/environments/v1.5/Project.toml`
  [052768ef] CUDA v2.4.1
  [312cec97] GemmKernels v0.1.0 `https://github.com/JuliaGPU/GemmKernels.jl#master`

julia> InteractiveUtils.versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 5122 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_NUM_THREADS = auto
  JULIA_PKG_SERVER =
  JULIA_CUDA_VERBOSE = true

julia> CUDA.versioninfo()
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.27.4
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.5.3
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

Environment:
- JULIA_CUDA_VERBOSE: true

1 device:
  0: TITAN V (sm_70, 11.346 GiB / 11.784 GiB available)

julia> CUDA.capability.(CUDA.devices())
1-element Array{VersionNumber,1}:
 v"7.0.0"

julia> M = N = K = 128
128

julia> T = Int16
Int16

julia> alpha = T(1)
1

julia> beta = T(0)
0

julia> a_h = rand(T, (M, K));

julia> b_h = rand(T, (K, N));

julia> c_h = rand(T, (M, N));

julia> a  = CUDA.CuArray(deepcopy(a_h));

julia> b  = CUDA.CuArray(deepcopy(b_h));

julia> c_cublas = CUDA.CuArray(deepcopy(c_h));

julia> c_gemmkernels = CUDA.CuArray(deepcopy(c_h));

julia> CUDA.CUBLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_cublas)
ERROR: ArgumentError: gemmEx does not support Int16=Int16*Int16
Stacktrace:
 [1] gemmEx!(::Char, ::Char, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T; algo::CUDA.CUBLAS.cublasGemmAlgo_t) at /users/daluthge/.julia/packages/CUDA/wTQsK/lib/cublas/wrappers.jl:829
 [2] gemmEx!(::Char, ::Char, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T) at /users/daluthge/.julia/packages/CUDA/wTQsK/lib/cublas/wrappers.jl:819
 [3] top-level scope at REPL[21]:1

julia> GemmKernels.BLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_gemmkernels)
ERROR: MethodError: no method matching shared_layout_ab(::Type{CUDA.CuArray{Int16,2}}, ::Val{false})
Closest candidates are:
  shared_layout_ab(::Type{CUDA.CuArray{Float16,N}}, ::Any) where N at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:13
  shared_layout_ab(::Type{LinearAlgebra.Diagonal{Float16,CUDA.CuArray{Float16,N}}}, ::Any) where {N, P} at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:14
Stacktrace:
 [1] gemmEx!(::Char, ::Char, ::Int16, ::CUDA.CuArray{Int16,2}, ::CUDA.CuArray{Int16,2}, ::Int16, ::CUDA.CuArray{Int16,2}) at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:46
 [2] top-level scope at REPL[22]:1

julia> T = Int32
Int32

julia> alpha = T(1)
1

julia> beta = T(0)
0

julia> a_h = rand(T, (M, K));

julia> b_h = rand(T, (K, N));

julia> c_h = rand(T, (M, N));

julia> a  = CUDA.CuArray(deepcopy(a_h));

julia> b  = CUDA.CuArray(deepcopy(b_h));

julia> c_cublas = CUDA.CuArray(deepcopy(c_h));

julia> c_gemmkernels = CUDA.CuArray(deepcopy(c_h));

julia> CUDA.CUBLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_cublas)
ERROR: ArgumentError: gemmEx does not support Int32=Int32*Int32
Stacktrace:
 [1] gemmEx!(::Char, ::Char, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T; algo::CUDA.CUBLAS.cublasGemmAlgo_t) at /users/daluthge/.julia/packages/CUDA/wTQsK/lib/cublas/wrappers.jl:829
 [2] gemmEx!(::Char, ::Char, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T, ::Number, ::Union{CUDA.CuArray{T,1}, CUDA.CuArray{T,2}} where T) at /users/daluthge/.julia/packages/CUDA/wTQsK/lib/cublas/wrappers.jl:819
 [3] top-level scope at REPL[33]:1

julia> GemmKernels.BLAS.gemmEx!('N', 'N', alpha, a, b, beta, c_gemmkernels)
ERROR: MethodError: no method matching shared_layout_ab(::Type{CUDA.CuArray{Int32,2}}, ::Val{false})
Closest candidates are:
  shared_layout_ab(::Type{CUDA.CuArray{Float16,N}}, ::Any) where N at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:13
  shared_layout_ab(::Type{LinearAlgebra.Diagonal{Float16,CUDA.CuArray{Float16,N}}}, ::Any) where {N, P} at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:14
Stacktrace:
 [1] gemmEx!(::Char, ::Char, ::Int32, ::CUDA.CuArray{Int32,2}, ::CUDA.CuArray{Int32,2}, ::Int32, ::CUDA.CuArray{Int32,2}) at /users/daluthge/.julia/packages/GemmKernels/s61GT/src/blas.jl:46
 [2] top-level scope at REPL[34]:1

cc: @vchuravy @thomasfaingnaert @maleadt

@DilumAluthge
Copy link
Contributor Author

Some Googling suggests that cuBLAS does in fact support integer matrices. But I can't get it to work when using the CUDA.CUBLAS.gemmEx! function.

Even if cuBLAS does indeed support integer matrices, I think it would be great to also have support for integer matrices in GemmKernels.jl.

@thomasfaingnaert
Copy link
Member

Tensor Cores of the Turing generation do support integer GEMM, but only with 1-bit, 4-bit, or 8-bit integer inputs, and 32-bit accumulation, so I'm guessing you would need to pass e.g. an 8-bit A and B matrix and a 32-bit C and D matrix to CUDA.CUBLAS.gemmEx!.

With regards to integer support for GemmKernels, I personally don't have a lot of time to continue working on it at the moment, but if anyone wants to add support, I'd be happy to answer any questions.

One would need to extend the WMMA support in CUDA.jl to also include the Turing-generation Tensor Core operations first, and then see which adaptations to GemmKernels.jl are necessary.

@maleadt
Copy link
Member

maleadt commented Jan 24, 2021

Or, alternatively, a non-WMMA code path. @thomasfaingnaert, we discussed that at some point, but could you detail what that would require? If we want to fold this into CUDA.jl or GPUArrays.jl we'll need such an option anyway.

@thomasfaingnaert
Copy link
Member

Or, alternatively, a non-WMMA code path. @thomasfaingnaert, we discussed that at some point, but could you detail what that would require? If we want to fold this into CUDA.jl or GPUArrays.jl we'll need such an option anyway.

Off the top of my head, adding support for new datatypes requires the following:

  • The AlignedRowMajor and AlignedColMajor layouts should be reusable directly. One would need to generalise the memory vectorisation code, though. I had to use some ugly workarounds to get around the fact that Float16 was mapped to i16 instead of half, but that shouldn't be a problem in Julia 1.6, so that code can be cleaned up significantly. Would it be an idea to add support for explicit vectorisation directly in CUDA.jl, so that one may do, e.g.
# arr1, arr2 = 512 x 512 Float32 CuArray
@aligned arr1[1, 5:8] = arr2[1, 5:8]

which automatically uses vectorised loads and stores?

  • Add a (WMMA/non-WMMA) operator that can handle more than just FP16/FP32 data types. Ideally, we want to add a generic operator that is used as a fallback in case the datatype is not compatible with WMMA / the user's GPU does not support WMMA.
  • Finally, some experimentation or brute force search to determine the optimal tiling parameters for this data type.

@thomasfaingnaert thomasfaingnaert linked a pull request May 15, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants