Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors on small array inputs #52

Closed
GiggleLiu opened this issue Nov 4, 2020 · 1 comment · Fixed by #130
Closed

Errors on small array inputs #52

GiggleLiu opened this issue Nov 4, 2020 · 1 comment · Fixed by #130

Comments

@GiggleLiu
Copy link

The BLAS.gemmEx! function errors on array size <128. I want to experiment the functionality on small array size like 16x16.

Error message

ERROR: LoadError: ArgumentError: Grid dimensions should be non-null
Stacktrace:
 [1] launch(::CuFunction, ::CuDeviceArray{Float16,2,1}, ::CuDeviceArray{Float16,2,1}, ::CuDeviceArray{Float32,2,1}, ::CuDeviceArray{Float32,2,1}, ::GemmKernels.Transform.Elementwise{GemmKernels.BLAS.var"#1#3"{Int64,Int64}}, ::GemmKernels.Transform.Elementwise{GemmKernels.BLAS.var"#2#4"{Int64}}; blocks::Tuple{Int64,Int64}, threads::Int64, cooperative::Bool, shmem::Int64, stream::CuStream) at /home/leo/.julia/dev/CUDA/lib/cudadrv/execution.jl:57
 [2] #599 at /home/leo/.julia/dev/CUDA/lib/cudadrv/execution.jl:138 [inlined]
 [3] macro expansion at /home/leo/.julia/dev/CUDA/lib/cudadrv/execution.jl:97 [inlined]
 [4] convert_arguments at /home/leo/.julia/dev/CUDA/lib/cudadrv/execution.jl:79 [inlined]
 [5] #cudacall#598 at /home/leo/.julia/dev/CUDA/lib/cudadrv/execution.jl:137 [inlined]
 [6] #cudacall#790 at /home/leo/.julia/dev/CUDA/src/compiler/execution.jl:219 [inlined]
 [7] macro expansion at /home/leo/.julia/dev/CUDA/src/compiler/execution.jl:200 [inlined]
 [8] call(::CUDA.HostKernel{GemmKernels.Kernel.matmul_pipelined,Tuple{CuDeviceArray{Float16,2,1},CuDeviceArray{Float16,2,1},CuDeviceArray{Float32,2,1},CuDeviceArray{Float32,2,1},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{GemmKernels.BLAS.var"#1#3"{Int64,Int64}},GemmKernels.Transform.Elementwise{GemmKernels.BLAS.var"#2#4"{Int64}},GemmKernels.Epilogue.Default,Type{GemmKernels.Config{(M = 64, N = 64, K = 64),(M = 128, N = 128, K = 64),8,(M = 128, K = 2),(M = 8, K = 1),(K = 64, N = 4),(K = 8, N = 1),(M = 128, N = 1),(M = 4, N = 1),(M = 32, N = 64, K = 16),(M = 16, N = 16, K = 16),GemmKernels.Layout.AlignedColMajor{Float16},GemmKernels.Layout.AlignedColMajor{Float16},GemmKernels.Layout.AlignedColMajor{Float32},GemmKernels.Layout.AlignedColMajor{Float32},GemmKernels.Layout.Padded{GemmKernels.Layout.AlignedColMajor{Float16},8},GemmKernels.Layout.Padded{GemmKernels.Layout.AlignedColMajor{Float16},8},GemmKernels.Layout.AlignedColMajor{Float32},GemmKernels.Layout.AlignedColMajor{Float32},WMMAOp{16,16,16},true,true}}}}, ::CuDeviceArray{Float16,2,1}, ::CuDeviceArray{Float16,2,1}, ::CuDeviceArray{Float32,2,1}, ::CuDeviceArray{Float32,2,1}, ::GemmKernels.Transform.Elementwise{typeof(identity)}, ::GemmKernels.Transform.Elementwise{typeof(identity)}, ::GemmKernels.Transform.Elementwise{typeof(identity)}, ::GemmKernels.Transform.Elementwise{typeof(identity)}, ::GemmKernels.Transform.Elementwise{typeof(identity)}, ::GemmKernels.Transform.Elementwise{typeof(identity)}, ::GemmKernels.Transform.Elementwise{GemmKernels.BLAS.var"#1#3"{Int64,Int64}}, ::GemmKernels.Transform.Elementwise{GemmKernels.BLAS.var"#2#4"{Int64}}, ::GemmKernels.Epilogue.Default, ::Type{GemmKernels.Config{(M = 64, N = 64, K = 64),(M = 128, N = 128, K = 64),8,(M = 128, K = 2),(M = 8, K = 1),(K = 64, N = 4),(K = 8, N = 1),(M = 128, N = 1),(M = 4, N = 1),(M = 32, N = 64, K = 16),(M = 16, N = 16, K = 16),GemmKernels.Layout.AlignedColMajor{Float16},GemmKernels.Layout.AlignedColMajor{Float16},GemmKernels.Layout.AlignedColMajor{Float32},GemmKernels.Layout.AlignedColMajor{Float32},GemmKernels.Layout.Padded{GemmKernels.Layout.AlignedColMajor{Float16},8},GemmKernels.Layout.Padded{GemmKernels.Layout.AlignedColMajor{Float16},8},GemmKernels.Layout.AlignedColMajor{Float32},GemmKernels.Layout.AlignedColMajor{Float32},WMMAOp{16,16,16},true,true}}; call_kwargs::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:threads, :blocks, :shmem),Tuple{Int64,Tuple{Int64,Int64},Int64}}}) at /home/leo/.julia/dev/CUDA/src/compiler/execution.jl:171
 [9] (::CUDA.HostKernel{GemmKernels.Kernel.matmul_pipelined,Tuple{CuDeviceArray{Float16,2,1},CuDeviceArray{Float16,2,1},CuDeviceArray{Float32,2,1},CuDeviceArray{Float32,2,1},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{typeof(identity)},GemmKernels.Transform.Elementwise{GemmKernels.BLAS.var"#1#3"{Int64,Int64}},GemmKernels.Transform.Elementwise{GemmKernels.BLAS.var"#2#4"{Int64}},GemmKernels.Epilogue.Default,Type{GemmKernels.Config{(M = 64, N = 64, K = 64),(M = 128, N = 128, K = 64),8,(M = 128, K = 2),(M = 8, K = 1),(K = 64, N = 4),(K = 8, N = 1),(M = 128, N = 1),(M = 4, N = 1),(M = 32, N = 64, K = 16),(M = 16, N = 16, K = 16),GemmKernels.Layout.AlignedColMajor{Float16},GemmKernels.Layout.AlignedColMajor{Float16},GemmKernels.Layout.AlignedColMajor{Float32},GemmKernels.Layout.AlignedColMajor{Float32},GemmKernels.Layout.Padded{GemmKernels.Layout.AlignedColMajor{Float16},8},GemmKernels.Layout.Padded{GemmKernels.Layout.AlignedColMajor{Float16},8},GemmKernels.Layout.AlignedColMajor{Float32},GemmKernels.Layout.AlignedColMajor{Float32},WMMAOp{16,16,16},true,true}}}})(::CuDeviceArray{Float16,2,1}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:threads, :blocks, :shmem),Tuple{Int64,Tuple{Int64,Int64},Int64}}}) at /home/leo/.julia/dev/CUDA/src/compiler/execution.jl:353
 [10] matmul(::CuArray{Float16,2}, ::CuArray{Float16,2}, ::CuArray{Float32,2}, ::CuArray{Float32,2}, ::Type{T} where T; transform_global_to_shared_a::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_global_to_shared_b::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_global_to_shared_c::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_shared_to_global_d::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_shared_to_regs_a::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_shared_to_regs_b::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_shared_to_regs_c::GemmKernels.Transform.Elementwise{GemmKernels.BLAS.var"#1#3"{Int64,Int64}}, transform_regs_to_shared_d::GemmKernels.Transform.Elementwise{GemmKernels.BLAS.var"#2#4"{Int64}}, epilogue::GemmKernels.Epilogue.Default, kernel::typeof(GemmKernels.Kernel.matmul_pipelined)) at /home/leo/.julia/dev/GemmKernels/src/launch.jl:26
 [11] gemmEx!(::Char, ::Char, ::Int64, ::CuArray{Float16,2}, ::CuArray{Float16,2}, ::Int64, ::CuArray{Float32,2}) at /home/leo/.julia/dev/GemmKernels/src/blas.jl:64
 [12] top-level scope at /home/leo/.julia/dev/GemmKernels/matmul.jl:42
 [13] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091
 [14] invokelatest(::Any, ::Any, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./essentials.jl:710
 [15] invokelatest(::Any, ::Any, ::Vararg{Any,N} where N) at ./essentials.jl:709
 [16] inlineeval(::Module, ::String, ::Int64, ::Int64, ::String; softscope::Bool) at /home/leo/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:83
 [17] (::VSCodeServer.var"#43#45"{VSCodeServer.ReplRunCodeRequestParams,String,Int64,Int64,String,Module,Bool})() at /home/leo/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:45
 [18] withpath(::VSCodeServer.var"#43#45"{VSCodeServer.ReplRunCodeRequestParams,String,Int64,Int64,String,Module,Bool}, ::String) at /home/leo/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/repl.jl:118
 [19] (::VSCodeServer.var"#42#44"{VSCodeServer.ReplRunCodeRequestParams,String,Int64,Int64,String,Module,Bool,Bool})() at /home/leo/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:43
 [20] hideprompt(::VSCodeServer.var"#42#44"{VSCodeServer.ReplRunCodeRequestParams,String,Int64,Int64,String,Module,Bool,Bool}) at /home/leo/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/repl.jl:36
 [21] repl_runcode_request(::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint,Base.PipeEndpoint}, ::VSCodeServer.ReplRunCodeRequestParams) at /home/leo/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/eval.jl:23
 [22] dispatch_msg(::VSCodeServer.JSONRPC.JSONRPCEndpoint{Base.PipeEndpoint,Base.PipeEndpoint}, ::VSCodeServer.JSONRPC.MsgDispatcher, ::Dict{String,Any}) at /home/leo/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/JSONRPC/src/typed.jl:66
 [23] macro expansion at /home/leo/.vscode/extensions/julialang.language-julia-1.0.8/scripts/packages/VSCodeServer/src/VSCodeServer.jl:95 [inlined]
 [24] (::VSCodeServer.var"#61#63"{Bool,String})() at ./task.jl:356
in expression starting at /home/leo/.julia/dev/GemmKernels/matmul.jl:42

Also, the computed value is not correct if the array size is not an exponent of

@thomasfaingnaert
Copy link
Member

This is a limitation of the current implementation: only arrays whose size is a multiple of the threadblock size (e.g. (M = 128, N = 128, K = 64) for WMMA mixed-precision) are supported at the moment.
One way to support arbitrary matrix dimensions would be to predicate the loads from global memory to only access elements inside the bounds of the global matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants