Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mapreduce with multiple arrays allocates #53417

Open
roflmaostc opened this issue Feb 21, 2024 · 7 comments · May be fixed by #55301
Open

mapreduce with multiple arrays allocates #53417

roflmaostc opened this issue Feb 21, 2024 · 7 comments · May be fixed by #55301
Labels
arrays [a, r, r, a, y, s] performance Must go faster

Comments

@roflmaostc
Copy link
Contributor

roflmaostc commented Feb 21, 2024

Hi,

With Julia 1.10.1 I noticed that mapreduce with two arrays allocates:

julia> x = randn((512,512));

julia> y = randn((512, 512));

julia> g(x) = x^2
g (generic function with 1 method)

julia> @time mapreduce(g,+,x)
  0.031165 seconds (16.85 k allocations: 1.149 MiB, 99.57% compilation time: 100% of which was recompilation)
261797.64934712922

# no signficant allocation, perfect
julia> @time mapreduce(g,+,x)
  0.000168 seconds (1 allocation: 16 bytes)
261797.64934712922

julia> f(x,y) = x * y
f (generic function with 1 method)

julia> mapreduce(f, +, x, y);

# bad allocations
julia> @time mapreduce(f, +, x, y);
  0.001142 seconds (6 allocations: 2.000 MiB)

Looking at the implementation there is no specialization.
Maybe we should adapt the over promising docstring then?

mapreduce is functionally equivalent to calling reduce(op, map(f, itr); init=init), but will in general execute faster since no intermediate collection
  needs to be created. See documentation for reduce and map.

Related to #38558 ?

Best,

Felix

@Moelf
Copy link
Contributor

Moelf commented Feb 21, 2024

looks like in some cases "bad allocation" case is actually faster:

julia> function manual(x,y)
           s = zero(eltype(x))
           for (i,j) in zip(x,y)
               s += i*j
           end
           return s
       end
manual (generic function with 1 method)

julia> @benchmark manual(x, y) setup=(x=rand(100,100); y=rand(100,100))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  9.783 μs   31.469 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     9.813 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.928 μs ± 677.873 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▄▄▃▃▂▁▁▁           ▁▁               ▁▂▁                   ▁
  ███████████▇▇▇▇▇▇▇█▇█████▇▇▅▅▅▅▃▃▄▄▃▄▅███▇▆▆▅▅▅▅▃▄▄▃▃▄▄▄▄▄▂ █
  9.78 μs      Histogram: log(frequency) by time        11 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(*, +, x, y) setup=(x=rand(100,100); y=rand(100,100))
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min  max):  5.445 μs  56.432 μs  ┊ GC (min  max): 0.00%  69.95%
 Time  (median):     6.132 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.592 μs ±  3.665 μs  ┊ GC (mean ± σ):  5.15% ±  7.96%

  ▅█▂▁                                                       ▁
  █████▇█▇▇▆▅▅▆▅▅▃▃▃▁▃▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅ █
  5.44 μs      Histogram: log(frequency) by time       38 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

@roflmaostc
Copy link
Contributor Author

Yeah but my GPU runs out of memory (gigabyte large arrays).
So I would rather like to avoid the allocations.

@roflmaostc
Copy link
Contributor Author

Not sure if your benchmarking is fair. It looks like it's not SIMD...

julia> function manual(x,y)
           s = zero(eltype(x))
           @simd for i in 1:length(x)
               @inbounds s += x[i] * y[i]
           end
           return s
       end
manual (generic function with 1 method)

julia> @benchmark manual(x,y) setup=(x=rand(100, 100); y=rand(100, 100))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.406 μs    4.790 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.559 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.616 μs ± 234.085 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▃  ▃▁ ▂▃▁                                                  
  ▃██▅▇███████▇▆▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.41 μs         Histogram: frequency by time        2.38 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(*, +, x,y) setup=(x=rand(100, 100); y=rand(100, 100))
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min  max):   7.389 μs  282.207 μs  ┊ GC (min  max): 0.00%  87.14%
 Time  (median):      9.740 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.754 μs ±  14.580 μs  ┊ GC (mean ± σ):  7.31% ±  5.95%

   ▇▂█                                                          
  ▇████▆▆▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  7.39 μs         Histogram: frequency by time         47.2 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

@roflmaostc
Copy link
Contributor Author

roflmaostc commented Feb 21, 2024

I found a way to express this with foldl to achieve roughly the same:

julia> foldf = (acc, t) -> acc[1] + t[1] * t[2]
#5 (generic function with 1 method)

julia> @time foldl(foldf, zip(x, y))
  0.000031 seconds (5 allocations: 144 bytes)
2498.8165946524705

julia> f(x,y) = x * y
f (generic function with 1 method)

julia> @time mapreduce(f, +, x, y)
  0.000059 seconds (6 allocations: 78.281 KiB)
2498.794164503838

@raminammour
Copy link
Contributor

Not sure if it covers all the use cases but maybe change this

julia/base/reducedim.jl

Lines 329 to 330 in 4e72944

mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
reduce(op, map(f, A...); kw...)

to

mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
    reduce(op, Iterators.map(f, A...); kw...)

The lazy map won't allocate, as promised in the documentation. The manual version still seems faster though.

raminammour added a commit to raminammour/julia that referenced this issue Feb 21, 2024
@mcabbott
Copy link
Contributor

Possible dup of #38558 as noted.

See #39053 and #41001 for earlier attempts to fix some cases.

@roflmaostc
Copy link
Contributor Author

roflmaostc commented Feb 21, 2024

Just to confirm, on CUDA.jl this indeed is allocation free:

julia> x = CuArray(rand(512, 512));

julia> y = copy(x);

julia> f(x,y) = x*y
f (generic function with 1 method)

julia> CUDA.@time mapreduce(f, +, x,y);
  7.071664 seconds (14.90 M CPU allocations: 1010.226 MiB, 3.82% gc time) (2 GPU allocations: 232 bytes, 0.00% memmgmt time)

julia> CUDA.@time mapreduce(f, +, x,y);
  0.000195 seconds (62 CPU allocations: 2.578 KiB) (2 GPU allocations: 232 bytes, 9.79% memmgmt time)

julia> xc = Array(x);

julia> yc = Array(y);

julia> @time mapreduce(f, +, xc,yc);
  0.122801 seconds (328.86 k allocations: 24.198 MiB, 5.45% gc time, 98.93% compilation time)

julia> @time mapreduce(f, +, xc,yc);
  0.001084 seconds (6 allocations: 2.000 MiB)

@mcabbott mcabbott added performance Must go faster arrays [a, r, r, a, y, s] labels Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
4 participants