mapreduce assumes commutative op #484

tkf · 2020-10-11T02:19:33Z

Describe the bug

It looks like CUDA.jl's mapreduce assumes commutativity of op here:

Lines 67 to 73 in bc2d5f4

    
           while d > 0 
        
               sync_threads() 
        
               if thread <= d 
        
                   shared[thread] = op(shared[thread], shared[thread+d]) 
        
               end 
        
               d >>= 1 
        
           end

and here

CUDA.jl/src/mapreduce.jl

Lines 114 to 121 in bc2d5f4

    
           # reduce serially across chunks of input vector that don't fit in a block 
        
           ireduce = threadIdx_reduce + (blockIdx_reduce - 1) * blockDim_reduce 
        
           while ireduce <= length(Rreduce) 
        
               Ireduce = Rreduce[ireduce] 
        
               J = Base.max(Iother, Ireduce) 
        
               val = op(val, f(_map_getindex(As, J)...)) 
        
               ireduce += blockDim_reduce * gridDim_reduce 
        
           end

To reproduce

As a result, the result of mapreduce differs from the CPU version even if op is exactly associative. For example, we can't implement findmax that returns the first maximum element and the location using mapreduce:

julia> using CUDA

julia> naive_findmax(xs) = mapreduce(
           tuple,
           ((a, i), (b, j)) -> a < b ? (b, j) : (a, i),
           xs,
           eachindex(xs);
           init = (typemin(eltype(xs)), -1),
       )
naive_findmax (generic function with 1 method)

julia> xs = CUDA.ones(1000);

julia> xs[1] = 0
0

julia> naive_findmax(xs)
(1.0f0, 513)

julia> naive_findmax(Array(xs))
(1.0f0, 2)

julia> findmax(Array(xs))
(1.0f0, 2)

Reading Base.reduce, my expectation has been that reduce(op, xs) and mapreduce(f, op, xs) require only the associativity of op and not commutativity. It would be nice if mapreduce implementations on GPU also only assume associativity of op. If it's difficult to implement, I think it'd be reasonable to mention this in the docstring.

Manifest.toml

# This file is machine-generated - editing it directly is not advised

[[AbstractFFTs]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "051c95d6836228d120f5f4b984dd5aba1624f716"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "0.5.0"

[[Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "42c42f2221906892ceb765dbcb1a51deeffd86d7"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "2.3.0"

[[BFloat16s]]
deps = ["LinearAlgebra", "Test"]
git-tree-sha1 = "4af69e205efc343068dc8722b8dfec1ade89254a"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.1.0"

[[Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"

[[BinaryProvider]]
deps = ["Libdl", "Logging", "SHA"]
git-tree-sha1 = "ecdec412a9abc8db54c0efc5548c64dfce072058"
uuid = "b99e7846-7c00-51b0-8f62-c81ae34c0232"
version = "0.5.10"

[[CEnum]]
git-tree-sha1 = "215a9aa4a1f23fbd05b92769fdd62559488d70e9"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.4.1"

[[CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "BinaryProvider", "CEnum", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "Libdl", "LinearAlgebra", "Logging", "MacroTools", "NNlib", "Pkg", "Printf", "Random", "Reexport", "Requires", "SparseArrays", "Statistics", "TimerOutputs"]
git-tree-sha1 = "eb6d045a717d7318a411bd29e3eb3b44b6ac18fe"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "2.0.1"

[[Compat]]
deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "SHA", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"]
git-tree-sha1 = "f76e41cf110de7176a657c72409e722cfc86fbb6"
uuid = "34da2185-b29b-5c13-b0c7-acf172513d20"
version = "3.20.0"

[[DataStructures]]
deps = ["Compat", "InteractiveUtils", "OrderedCollections"]
git-tree-sha1 = "db07bb22795762895b60e44d62b34b16c982a687"
uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"
version = "0.18.7"

[[Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[DelimitedFiles]]
deps = ["Mmap"]
uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab"

[[Distributed]]
deps = ["Random", "Serialization", "Sockets"]
uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b"

[[ExprTools]]
git-tree-sha1 = "7fce513fcda766962ff67c5596cb16c463dfd371"
uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04"
version = "0.1.2"

[[GPUArrays]]
deps = ["AbstractFFTs", "Adapt", "LinearAlgebra", "Printf", "Random", "Serialization"]
git-tree-sha1 = "e39817aafb64a0794817a1e5126d042d0b26f700"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "6.0.1"

[[GPUCompiler]]
deps = ["DataStructures", "InteractiveUtils", "LLVM", "Libdl", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "1b19d415fc3581ff0ed2f57875fca16b5190060a"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "0.7.3"

[[InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"

[[LLVM]]
deps = ["CEnum", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "70070a0131f17fcffc5fc004f5f73f037bd217c5"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "3.2.0"

[[LibGit2]]
deps = ["Printf"]
uuid = "76f85450-5226-5b5a-8eaa-529ad045b433"

[[Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[LinearAlgebra]]
deps = ["Libdl"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[MacroTools]]
deps = ["Markdown", "Random"]
git-tree-sha1 = "f7d2e3f654af75f01ec49be82c231c382214223a"
uuid = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
version = "0.5.5"

[[Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"

[[Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[NNlib]]
deps = ["Libdl", "LinearAlgebra", "Pkg", "Requires", "Statistics"]
git-tree-sha1 = "1ef04283efe283be08e2d0de842f5e5286dd0b7a"
uuid = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
version = "0.7.5"

[[OrderedCollections]]
git-tree-sha1 = "16c08bf5dba06609fe45e30860092d6fa41fde7b"
uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
version = "1.3.1"

[[Pkg]]
deps = ["Dates", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "UUIDs"]
uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"

[[Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[REPL]]
deps = ["InteractiveUtils", "Markdown", "Sockets"]
uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"

[[Random]]
deps = ["Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[Reexport]]
deps = ["Pkg"]
git-tree-sha1 = "7b1d07f411bc8ddb7977ec7f377b97b158514fe0"
uuid = "189a3867-3050-52da-a836-e630ba90ab69"
version = "0.2.0"

[[Requires]]
deps = ["UUIDs"]
git-tree-sha1 = "28faf1c963ca1dc3ec87f166d92982e3c4a1f66d"
uuid = "ae029012-a4dd-5104-9daa-d747884805df"
version = "1.1.0"

[[SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"

[[Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[SharedArrays]]
deps = ["Distributed", "Mmap", "Random", "Serialization"]
uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383"

[[Sockets]]
uuid = "6462fe0b-24de-5631-8697-dd941f90decc"

[[SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[Test]]
deps = ["Distributed", "InteractiveUtils", "Logging", "Random"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[[TimerOutputs]]
deps = ["Printf"]
git-tree-sha1 = "f458ca23ff80e46a630922c555d838303e4b9603"
uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
version = "0.5.6"

[[UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

Expected behavior

I expect naive_findmax(xs::CuArray) == naive_findmax(Array(xs)).

Version info

Details on Julia:

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, broadwell)

Details on CUDA:

julia> CUDA.versioninfo()
CUDA toolkit 11.0.3, artifact installation
CUDA driver 11.1.0
NVIDIA driver 455.23.5

Libraries: 
- CUBLAS: 11.2.0
- CURAND: 10.2.1
- CUFFT: 10.2.1
- CUSOLVER: 10.6.0
- CUSPARSE: 11.1.1
- CUPTI: 13.0.0
- NVML: 11.0.0+455.23.5
- CUDNN: 8.0.3 (for CUDA 11.0.0)
- CUTENSOR: 1.2.0 (for CUDA 11.0.0)

Toolchain:
- Julia: 1.5.2
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

9 devices:
  0: Tesla V100-PCIE-32GB (sm_70, 30.991 GiB / 31.749 GiB available)
  1: Tesla V100-PCIE-32GB (sm_70, 31.678 GiB / 31.749 GiB available)
  2: Tesla V100-PCIE-32GB (sm_70, 31.702 GiB / 31.749 GiB available)
  3: Tesla V100-PCIE-32GB (sm_70, 29.066 GiB / 31.749 GiB available)
  4: Tesla V100-PCIE-16GB (sm_70, 15.778 GiB / 15.782 GiB available)
  5: Tesla P100-PCIE-16GB (sm_60, 15.897 GiB / 15.899 GiB available)
  6: Tesla P100-PCIE-16GB (sm_60, 15.897 GiB / 15.899 GiB available)
  7: GeForce GTX 1080 Ti (sm_61, 10.913 GiB / 10.917 GiB available)
  8: GeForce GTX 1080 Ti (sm_61, 10.913 GiB / 10.917 GiB available)

Additional context

I wonder if you can iterate using d <<= 1 instead of d >>= 1 to recover the Base-compatibility.

The text was updated successfully, but these errors were encountered:

tkf · 2020-10-12T02:50:26Z

I wonder if you can iterate using d <<= 1 instead of d >>= 1 to recover the Base-compatibility.

I just implemented a Base-compat version of one-dimensional (no dims) reduce: https://github.com/JuliaFolds/CUDAFolds.jl/blob/cc02b3862ca7bc9baeb9dd71a1b9e459afe4e37b/src/kernels.jl#L167-L179

I suspect mapreducedim! can also be implemented in a similar manner but my impression is that the current implementation in CUDA.jl prefers a different access pattern. It's the first GPU kernel I write so I'm not super confident, though.

maleadt · 2020-10-20T09:51:21Z

How can we fix this without, essentially, serializing the reduction? When reducing a 1D vector in parallel, aren't we always going to be reordering operations? Changing the order to something <<= based helps for findmax, but not for the general case (although it might be a good change nonetheless because it might be easier to coalesce):

julia> reduce(-,CuArray([1,2,3,4]))
>>=
Round 1
Thread 1: reducing element 1 and 3
Thread 2: reducing element 2 and 4
Round 2
Thread 1: reducing element 1 and 2
0

julia> reduce(-,CuArray([1,2,3,4]))
<<=
Round 1
Thread 1: reducing element 1 and 2
Thread 2: reducing element 3 and 4
Round 2
Thread 1: reducing element 1 and 3
1

julia> reduce(-,[1,2,3,4])
-8

EDIT: whoops, I'm a dummy, - is not associative obviously.

Ellipse0934 · 2020-10-21T04:43:54Z

Interestingly the naive_findmax here is faster than the current findmax.

julia> A = CUDA.rand(1 << 29);

julia> @benchmark CUDA.@sync findmax(A)
BenchmarkTools.Trial: 
  memory estimate:  4.13 KiB
  allocs estimate:  149
  --------------
  minimum time:     34.977 ms (0.00% GC)
  median time:      35.076 ms (0.00% GC)
  mean time:        35.108 ms (0.00% GC)
  maximum time:     35.557 ms (0.00% GC)
  --------------
  samples:          143
  evals/sample:     1

julia> @benchmark CUDA.@sync naive_findmax(A)
BenchmarkTools.Trial: 
  memory estimate:  2.13 KiB
  allocs estimate:  85
  --------------
  minimum time:     17.509 ms (0.00% GC)
  median time:      17.607 ms (0.00% GC)
  mean time:        17.617 ms (0.00% GC)
  maximum time:     17.893 ms (0.00% GC)
  --------------
  samples:          284
  evals/sample:     1

julia> CUDA.@time findmax(A)
  0.038558 seconds (158 CPU allocations: 5.297 KiB) (3 GPU allocations: 92 bytes, 0.02% gc time)
(1.0f0, 35222192)

julia> CUDA.@time naive_findmax(A)
  0.017996 seconds (85 CPU allocations: 2.125 KiB) (2 GPU allocations: 336 bytes, 0.02% gc time)
(1.0f0, 275943538)

Which makes sense as the current findmax does minimum and later findfirstval as outlined in #320 .

Can this be considered a replacement pending the associativity fix ? Even modifying the expression to ((a, i), (b, j)) -> a < b ? (b, j) : (a == b ? (a, min(i, j)) : (a, i)) has to same performance.

maleadt · 2020-10-21T06:19:09Z

Ha, that's funny. Yes, it's probably worth doing so awaiting a better implementation.

tkf · 2020-10-21T20:13:27Z

Can this be considered a replacement

We need to use isless to get a correct behavior for NaN (and missing when CuArray supports union eltype).

tkf added the bug Something isn't working label Oct 11, 2020

maleadt mentioned this issue Oct 20, 2020

mapreduce: change iteration for compatibility with non-commutative operators. #500

Merged

maleadt closed this as completed in #500 Oct 28, 2020

maleadt mentioned this issue Nov 27, 2020

New findmin/max implementation using single-pass reduction #576

Merged

tkf mentioned this issue Jul 21, 2021

mapreduce still assumes commutative op #1058

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mapreduce assumes commutative op #484

mapreduce assumes commutative op #484

tkf commented Oct 11, 2020 •

edited

Loading

tkf commented Oct 12, 2020 •

edited

Loading

maleadt commented Oct 20, 2020 •

edited

Loading

Ellipse0934 commented Oct 21, 2020 •

edited

Loading

maleadt commented Oct 21, 2020

tkf commented Oct 21, 2020

mapreduce assumes commutative op #484

mapreduce assumes commutative op #484

Comments

tkf commented Oct 11, 2020 • edited Loading

tkf commented Oct 12, 2020 • edited Loading

maleadt commented Oct 20, 2020 • edited Loading

Ellipse0934 commented Oct 21, 2020 • edited Loading

maleadt commented Oct 21, 2020

tkf commented Oct 21, 2020

tkf commented Oct 11, 2020 •

edited

Loading

tkf commented Oct 12, 2020 •

edited

Loading

maleadt commented Oct 20, 2020 •

edited

Loading

Ellipse0934 commented Oct 21, 2020 •

edited

Loading