big mapreduce performance #1012

xaellison · 2021-06-24T22:02:57Z

This is an optimization for multidimensional mapreduce when the non-reduced domain is large. (eg, sum(CUDA.zeros(Int, 40000, 100), dims=1)). When there is an independent reduction problem for every thread on the device, a naive loop is more efficient than reduce_block because not every thread is doing work at each iteration.

Performance tested like:

for X in map(i->1<<i, 11:18)
for Y in map(i->1<<i, 2:2:18)
if(1024 * 32 <= X *Y <= 1024 * 1024 * 128)
      b=@benchmark begin CUDA.@sync minimum(c, dims=2) end setup=begin c=CuArray(rand(Int32, $X,$Y))end teardown=begin GC.gc(); sleep(0.01) end evals=1 samples=3 seconds=3
      println((X, Y, median(b.times)))
      end 
end 
end

With minimum, findmax, sum, and sum(cos, .... The performance gain scales with problem size, but I've measured up to 10x speedup (on sum for size 262144, 256). The test was performed without the current size check to allow all cases.

boundary conds and un-hardcoding guard clause either no change in perf or mild (10%) benefit depending on f, op test naming consistency pt2 pt3

codecov · 2021-06-24T23:52:30Z

Codecov Report

Merging #1012 (b40c048) into master (828d44e) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1012      +/-   ##
==========================================
+ Coverage   78.91%   78.93%   +0.01%     
==========================================
  Files         122      122              
  Lines        7960     7969       +9     
==========================================
+ Hits         6282     6290       +8     
- Misses       1678     1679       +1

Impacted Files	Coverage Δ
src/mapreduce.jl	`100.00% <100.00%> (ø)`
lib/cusolver/CUSOLVER.jl	`86.36% <0.00%> (-1.14%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 828d44e...b40c048. Read the comment docs.

maleadt · 2021-06-29T05:39:09Z

The occupancy API also returns the minimal numbers of blocks that should be active to reach full occupancy -- maybe that could be used instead of the current big_mapreduce_threshold heuristic? But then we'd need to compile both kernels, which seems wasteful. Let's go with this for now.

maleadt · 2021-06-29T05:41:15Z

Hm, I just saw the commit message, b40c048, it would have been better to put something more clean there... I generally don't squash-merge because the history can be useful.

Add simple mapreduce implementation for big reductions.

big mapreduce

b40c048

boundary conds and un-hardcoding guard clause either no change in perf or mild (10%) benefit depending on f, op test naming consistency pt2 pt3

maleadt merged commit 766b39f into JuliaGPU:master Jun 29, 2021

maleadt added a commit that referenced this pull request Jun 29, 2021

Merge pull request #1012 from xaellison/ae/big_mapreduce

057bd8f

Add simple mapreduce implementation for big reductions.

maleadt mentioned this pull request Sep 30, 2021

sum! does not compile for large arrays #1169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

big mapreduce performance #1012

big mapreduce performance #1012

xaellison commented Jun 24, 2021 •

edited

Loading

codecov bot commented Jun 24, 2021 •

edited

Loading

maleadt commented Jun 29, 2021

maleadt commented Jun 29, 2021

big mapreduce performance #1012

big mapreduce performance #1012

Conversation

xaellison commented Jun 24, 2021 • edited Loading

codecov bot commented Jun 24, 2021 • edited Loading

Codecov Report

maleadt commented Jun 29, 2021

maleadt commented Jun 29, 2021

xaellison commented Jun 24, 2021 •

edited

Loading

codecov bot commented Jun 24, 2021 •

edited

Loading