Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

big mapreduce performance #1012

Merged
merged 1 commit into from
Jun 29, 2021
Merged

Conversation

xaellison
Copy link
Contributor

@xaellison xaellison commented Jun 24, 2021

This is an optimization for multidimensional mapreduce when the non-reduced domain is large. (eg, sum(CUDA.zeros(Int, 40000, 100), dims=1)). When there is an independent reduction problem for every thread on the device, a naive loop is more efficient than reduce_block because not every thread is doing work at each iteration.

Performance tested like:

for X in map(i->1<<i, 11:18)
for Y in map(i->1<<i, 2:2:18)
if(1024 * 32 <= X *Y <= 1024 * 1024 * 128)
      b=@benchmark begin CUDA.@sync minimum(c, dims=2) end setup=begin c=CuArray(rand(Int32, $X,$Y))end teardown=begin GC.gc(); sleep(0.01) end evals=1 samples=3 seconds=3
      println((X, Y, median(b.times)))
      end 
end 
end

With minimum, findmax, sum, and sum(cos, .... The performance gain scales with problem size, but I've measured up to 10x speedup (on sum for size 262144, 256). The test was performed without the current size check to allow all cases.

boundary conds and un-hardcoding

guard clause

either no change in perf or mild (10%) benefit depending on f, op

test

naming consistency

pt2

pt3
@codecov
Copy link

codecov bot commented Jun 24, 2021

Codecov Report

Merging #1012 (b40c048) into master (828d44e) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1012      +/-   ##
==========================================
+ Coverage   78.91%   78.93%   +0.01%     
==========================================
  Files         122      122              
  Lines        7960     7969       +9     
==========================================
+ Hits         6282     6290       +8     
- Misses       1678     1679       +1     
Impacted Files Coverage Δ
src/mapreduce.jl 100.00% <100.00%> (ø)
lib/cusolver/CUSOLVER.jl 86.36% <0.00%> (-1.14%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 828d44e...b40c048. Read the comment docs.

@maleadt
Copy link
Member

maleadt commented Jun 29, 2021

The occupancy API also returns the minimal numbers of blocks that should be active to reach full occupancy -- maybe that could be used instead of the current big_mapreduce_threshold heuristic? But then we'd need to compile both kernels, which seems wasteful. Let's go with this for now.

@maleadt maleadt merged commit 766b39f into JuliaGPU:master Jun 29, 2021
@maleadt
Copy link
Member

maleadt commented Jun 29, 2021

Hm, I just saw the commit message, b40c048, it would have been better to put something more clean there... I generally don't squash-merge because the history can be useful.

maleadt added a commit that referenced this pull request Jun 29, 2021
Add simple mapreduce implementation for big reductions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants