Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mapreduce assumes commutative op #484

Closed
tkf opened this issue Oct 11, 2020 · 5 comments · Fixed by #500
Closed

mapreduce assumes commutative op #484

tkf opened this issue Oct 11, 2020 · 5 comments · Fixed by #500
Labels
bug Something isn't working

Comments

@tkf
Copy link
Contributor

tkf commented Oct 11, 2020

Describe the bug

It looks like CUDA.jl's mapreduce assumes commutativity of op here:

CUDA.jl/src/mapreduce.jl

Lines 67 to 73 in bc2d5f4

while d > 0
sync_threads()
if thread <= d
shared[thread] = op(shared[thread], shared[thread+d])
end
d >>= 1
end

and here

CUDA.jl/src/mapreduce.jl

Lines 114 to 121 in bc2d5f4

# reduce serially across chunks of input vector that don't fit in a block
ireduce = threadIdx_reduce + (blockIdx_reduce - 1) * blockDim_reduce
while ireduce <= length(Rreduce)
Ireduce = Rreduce[ireduce]
J = Base.max(Iother, Ireduce)
val = op(val, f(_map_getindex(As, J)...))
ireduce += blockDim_reduce * gridDim_reduce
end

To reproduce

As a result, the result of mapreduce differs from the CPU version even if op is exactly associative. For example, we can't implement findmax that returns the first maximum element and the location using mapreduce:

julia> using CUDA

julia> naive_findmax(xs) = mapreduce(
           tuple,
           ((a, i), (b, j)) -> a < b ? (b, j) : (a, i),
           xs,
           eachindex(xs);
           init = (typemin(eltype(xs)), -1),
       )
naive_findmax (generic function with 1 method)

julia> xs = CUDA.ones(1000);

julia> xs[1] = 0
0

julia> naive_findmax(xs)
(1.0f0, 513)

julia> naive_findmax(Array(xs))
(1.0f0, 2)

julia> findmax(Array(xs))
(1.0f0, 2)

Reading Base.reduce, my expectation has been that reduce(op, xs) and mapreduce(f, op, xs) require only the associativity of op and not commutativity. It would be nice if mapreduce implementations on GPU also only assume associativity of op. If it's difficult to implement, I think it'd be reasonable to mention this in the docstring.

Manifest.toml

# This file is machine-generated - editing it directly is not advised

[[AbstractFFTs]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "051c95d6836228d120f5f4b984dd5aba1624f716"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "0.5.0"

[[Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "42c42f2221906892ceb765dbcb1a51deeffd86d7"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "2.3.0"

[[BFloat16s]]
deps = ["LinearAlgebra", "Test"]
git-tree-sha1 = "4af69e205efc343068dc8722b8dfec1ade89254a"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.1.0"

[[Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"

[[BinaryProvider]]
deps = ["Libdl", "Logging", "SHA"]
git-tree-sha1 = "ecdec412a9abc8db54c0efc5548c64dfce072058"
uuid = "b99e7846-7c00-51b0-8f62-c81ae34c0232"
version = "0.5.10"

[[CEnum]]
git-tree-sha1 = "215a9aa4a1f23fbd05b92769fdd62559488d70e9"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.4.1"

[[CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "BinaryProvider", "CEnum", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "Libdl", "LinearAlgebra", "Logging", "MacroTools", "NNlib", "Pkg", "Printf", "Random", "Reexport", "Requires", "SparseArrays", "Statistics", "TimerOutputs"]
git-tree-sha1 = "eb6d045a717d7318a411bd29e3eb3b44b6ac18fe"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "2.0.1"

[[Compat]]
deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "SHA", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"]
git-tree-sha1 = "f76e41cf110de7176a657c72409e722cfc86fbb6"
uuid = "34da2185-b29b-5c13-b0c7-acf172513d20"
version = "3.20.0"

[[DataStructures]]
deps = ["Compat", "InteractiveUtils", "OrderedCollections"]
git-tree-sha1 = "db07bb22795762895b60e44d62b34b16c982a687"
uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"
version = "0.18.7"

[[Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[DelimitedFiles]]
deps = ["Mmap"]
uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab"

[[Distributed]]
deps = ["Random", "Serialization", "Sockets"]
uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b"

[[ExprTools]]
git-tree-sha1 = "7fce513fcda766962ff67c5596cb16c463dfd371"
uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04"
version = "0.1.2"

[[GPUArrays]]
deps = ["AbstractFFTs", "Adapt", "LinearAlgebra", "Printf", "Random", "Serialization"]
git-tree-sha1 = "e39817aafb64a0794817a1e5126d042d0b26f700"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "6.0.1"

[[GPUCompiler]]
deps = ["DataStructures", "InteractiveUtils", "LLVM", "Libdl", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "1b19d415fc3581ff0ed2f57875fca16b5190060a"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "0.7.3"

[[InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"

[[LLVM]]
deps = ["CEnum", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "70070a0131f17fcffc5fc004f5f73f037bd217c5"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "3.2.0"

[[LibGit2]]
deps = ["Printf"]
uuid = "76f85450-5226-5b5a-8eaa-529ad045b433"

[[Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[LinearAlgebra]]
deps = ["Libdl"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[MacroTools]]
deps = ["Markdown", "Random"]
git-tree-sha1 = "f7d2e3f654af75f01ec49be82c231c382214223a"
uuid = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
version = "0.5.5"

[[Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"

[[Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[NNlib]]
deps = ["Libdl", "LinearAlgebra", "Pkg", "Requires", "Statistics"]
git-tree-sha1 = "1ef04283efe283be08e2d0de842f5e5286dd0b7a"
uuid = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
version = "0.7.5"

[[OrderedCollections]]
git-tree-sha1 = "16c08bf5dba06609fe45e30860092d6fa41fde7b"
uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
version = "1.3.1"

[[Pkg]]
deps = ["Dates", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "UUIDs"]
uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"

[[Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[REPL]]
deps = ["InteractiveUtils", "Markdown", "Sockets"]
uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"

[[Random]]
deps = ["Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[Reexport]]
deps = ["Pkg"]
git-tree-sha1 = "7b1d07f411bc8ddb7977ec7f377b97b158514fe0"
uuid = "189a3867-3050-52da-a836-e630ba90ab69"
version = "0.2.0"

[[Requires]]
deps = ["UUIDs"]
git-tree-sha1 = "28faf1c963ca1dc3ec87f166d92982e3c4a1f66d"
uuid = "ae029012-a4dd-5104-9daa-d747884805df"
version = "1.1.0"

[[SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"

[[Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[SharedArrays]]
deps = ["Distributed", "Mmap", "Random", "Serialization"]
uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383"

[[Sockets]]
uuid = "6462fe0b-24de-5631-8697-dd941f90decc"

[[SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[Test]]
deps = ["Distributed", "InteractiveUtils", "Logging", "Random"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[[TimerOutputs]]
deps = ["Printf"]
git-tree-sha1 = "f458ca23ff80e46a630922c555d838303e4b9603"
uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
version = "0.5.6"

[[UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

Expected behavior

I expect naive_findmax(xs::CuArray) == naive_findmax(Array(xs)).

Version info

Details on Julia:

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, broadwell)

Details on CUDA:

julia> CUDA.versioninfo()
CUDA toolkit 11.0.3, artifact installation
CUDA driver 11.1.0
NVIDIA driver 455.23.5

Libraries: 
- CUBLAS: 11.2.0
- CURAND: 10.2.1
- CUFFT: 10.2.1
- CUSOLVER: 10.6.0
- CUSPARSE: 11.1.1
- CUPTI: 13.0.0
- NVML: 11.0.0+455.23.5
- CUDNN: 8.0.3 (for CUDA 11.0.0)
- CUTENSOR: 1.2.0 (for CUDA 11.0.0)

Toolchain:
- Julia: 1.5.2
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

9 devices:
  0: Tesla V100-PCIE-32GB (sm_70, 30.991 GiB / 31.749 GiB available)
  1: Tesla V100-PCIE-32GB (sm_70, 31.678 GiB / 31.749 GiB available)
  2: Tesla V100-PCIE-32GB (sm_70, 31.702 GiB / 31.749 GiB available)
  3: Tesla V100-PCIE-32GB (sm_70, 29.066 GiB / 31.749 GiB available)
  4: Tesla V100-PCIE-16GB (sm_70, 15.778 GiB / 15.782 GiB available)
  5: Tesla P100-PCIE-16GB (sm_60, 15.897 GiB / 15.899 GiB available)
  6: Tesla P100-PCIE-16GB (sm_60, 15.897 GiB / 15.899 GiB available)
  7: GeForce GTX 1080 Ti (sm_61, 10.913 GiB / 10.917 GiB available)
  8: GeForce GTX 1080 Ti (sm_61, 10.913 GiB / 10.917 GiB available)

Additional context

I wonder if you can iterate using d <<= 1 instead of d >>= 1 to recover the Base-compatibility.

@tkf tkf added the bug Something isn't working label Oct 11, 2020
@tkf
Copy link
Contributor Author

tkf commented Oct 12, 2020

I wonder if you can iterate using d <<= 1 instead of d >>= 1 to recover the Base-compatibility.

I just implemented a Base-compat version of one-dimensional (no dims) reduce: https://github.com/JuliaFolds/CUDAFolds.jl/blob/cc02b3862ca7bc9baeb9dd71a1b9e459afe4e37b/src/kernels.jl#L167-L179

I suspect mapreducedim! can also be implemented in a similar manner but my impression is that the current implementation in CUDA.jl prefers a different access pattern. It's the first GPU kernel I write so I'm not super confident, though.

@maleadt
Copy link
Member

maleadt commented Oct 20, 2020

How can we fix this without, essentially, serializing the reduction? When reducing a 1D vector in parallel, aren't we always going to be reordering operations? Changing the order to something <<= based helps for findmax, but not for the general case (although it might be a good change nonetheless because it might be easier to coalesce):

julia> reduce(-,CuArray([1,2,3,4]))
>>=
Round 1
Thread 1: reducing element 1 and 3
Thread 2: reducing element 2 and 4
Round 2
Thread 1: reducing element 1 and 2
0

julia> reduce(-,CuArray([1,2,3,4]))
<<=
Round 1
Thread 1: reducing element 1 and 2
Thread 2: reducing element 3 and 4
Round 2
Thread 1: reducing element 1 and 3
1

julia> reduce(-,[1,2,3,4])
-8

EDIT: whoops, I'm a dummy, - is not associative obviously.

@Ellipse0934
Copy link
Contributor

Ellipse0934 commented Oct 21, 2020

Interestingly the naive_findmax here is faster than the current findmax.

julia> A = CUDA.rand(1 << 29);

julia> @benchmark CUDA.@sync findmax(A)
BenchmarkTools.Trial: 
  memory estimate:  4.13 KiB
  allocs estimate:  149
  --------------
  minimum time:     34.977 ms (0.00% GC)
  median time:      35.076 ms (0.00% GC)
  mean time:        35.108 ms (0.00% GC)
  maximum time:     35.557 ms (0.00% GC)
  --------------
  samples:          143
  evals/sample:     1

julia> @benchmark CUDA.@sync naive_findmax(A)
BenchmarkTools.Trial: 
  memory estimate:  2.13 KiB
  allocs estimate:  85
  --------------
  minimum time:     17.509 ms (0.00% GC)
  median time:      17.607 ms (0.00% GC)
  mean time:        17.617 ms (0.00% GC)
  maximum time:     17.893 ms (0.00% GC)
  --------------
  samples:          284
  evals/sample:     1

julia> CUDA.@time findmax(A)
  0.038558 seconds (158 CPU allocations: 5.297 KiB) (3 GPU allocations: 92 bytes, 0.02% gc time)
(1.0f0, 35222192)

julia> CUDA.@time naive_findmax(A)
  0.017996 seconds (85 CPU allocations: 2.125 KiB) (2 GPU allocations: 336 bytes, 0.02% gc time)
(1.0f0, 275943538)

Which makes sense as the current findmax does minimum and later findfirstval as outlined in #320 .

Can this be considered a replacement pending the associativity fix ? Even modifying the expression to ((a, i), (b, j)) -> a < b ? (b, j) : (a == b ? (a, min(i, j)) : (a, i)) has to same performance.

@maleadt
Copy link
Member

maleadt commented Oct 21, 2020

Ha, that's funny. Yes, it's probably worth doing so awaiting a better implementation.

@tkf
Copy link
Contributor Author

tkf commented Oct 21, 2020

Can this be considered a replacement

We need to use isless to get a correct behavior for NaN (and missing when CuArray supports union eltype).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants