Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of mapreduce #46

Closed
FuZhiyu opened this issue Aug 24, 2022 · 8 comments · Fixed by #303
Closed

Poor performance of mapreduce #46

FuZhiyu opened this issue Aug 24, 2022 · 8 comments · Fixed by #303
Labels
arrays Things about the array abstraction. performance Gotta go fast.

Comments

@FuZhiyu
Copy link

FuZhiyu commented Aug 24, 2022

Probably a known issue by the devs but just for the record:

using Metal, BenchmarkTools
N = 10_000_000
a = rand(Float32, N)
Ma = MtlArray(a)
@btime sum($a)
# 757.209 μs
@btime sum($Ma)
# 3.173 ms

An in place operation will yield even slower performance:

r = Metal.zeros(Float32, 1)
@btime Metal.@sync sum!($r, $Ma)
# 1.603 s (167108 allocations: 4.20 MiB)

Platform: Mac Studio with Apple M1 Max, v1.8.0,

just realized I'm not on Ventura, but Monterey instead. I don't know whether this is the cause of the performance. Other matrix operations are pretty fast though.

@mchitre
Copy link

mchitre commented Nov 7, 2022

Similar results on Ventura as well, so that's not the cause.

@maxwindiff
Copy link
Contributor

On my computer:

julia> a = fill(Float32(1.0), 10*1024*1024);
julia> da = MtlArray(a);
julia> @btime sum(a)
  844.500 μs (1 allocation: 16 bytes)
1.048576f7
julia> @btime sum(da)
  2.707 ms (857 allocations: 23.66 KiB)
1.048576f7

Now, if we do this:

diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 1d84d78..900f21d 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -123,7 +123,7 @@ function partial_mapreduce_device(f, op, neutral, maxthreads, Rreduce, Rother, s
             ireduce += localDim_reduce * groupDim_reduce
         end
 
-        val = reduce_group(op, val, neutral, shuffle, maxthreads)
+        val = 1 # reduce_group(op, val, neutral, shuffle, maxthreads)
 
         # write back to memory
         if localIdx_reduce == 1

It still takes 2ms to simply loop over the input/output arrays!

julia> @btime sum(da)
  2.015 ms (857 allocations: 23.66 KiB)
1.0f0

My guess is that the slowdown is from all the indexing calculations (same as #41). But it's even harder to eliminate the cartesian indexing because the reduction process itself can add additional dimensions...

@maxwindiff
Copy link
Contributor

I tried writing a reduction kernel which only supports 1d arrays, and it's about 4x as fast as the current implementation. I'll try to see if the generic implementation can be further improved.

@maxwindiff
Copy link
Contributor

Reductions are generally faster now, however in-place is still very slow:

julia> @btime sum($a)
  760.000 μs (0 allocations: 0 bytes)
5.001241f6

julia> @btime sum($Ma)
  708.083 μs (1197 allocations: 27.76 KiB)
5.001241f6

julia> @btime Metal.@sync sum!($r, $Ma)
  376.325 ms (101199 allocations: 2.00 MiB)
1-element MtlVector{Float32}:
 5.001241f6

@maxwindiff
Copy link
Contributor

In-place is slow because it's hitting the init === nothing code path: https://github.com/JuliaGPU/Metal.jl/blob/main/src/mapreduce.jl#L230-L237

If GPUArrays.neutral_element() returned nothing by default, we may be able to something like:

-Base.mapreducedim!(f, op, R::AnyGPUArray, A::AbstractArray) = mapreducedim!(f, op, R, A)
+Base.mapreducedim!(f, op, R::AnyGPUArray{T}, A::AbstractArray) where {T} =
+  mapreducedim!(f, op, R, A; init=neutral_element(op, T))

With my limited Julia fundamentals knowledge, I don't know how to extend neutral_element without breaking compatibility. Let me try other ways of initializing the partial reduction array...

@maleadt maleadt changed the title low performance of mapreduce over MtlArray Poor performance of mapreduce May 22, 2023
@maleadt maleadt added the arrays Things about the array abstraction. label May 22, 2023
@maleadt
Copy link
Member

maleadt commented Nov 24, 2023

@rveltz
Copy link

rveltz commented Nov 24, 2023

It would be good to write a similar blog using Metal.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays Things about the array abstraction. performance Gotta go fast.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants