mapreduce with multiple arrays allocates #53417

roflmaostc · 2024-02-21T12:45:08Z

Hi,

With Julia 1.10.1 I noticed that mapreduce with two arrays allocates:

julia> x = randn((512,512));

julia> y = randn((512, 512));

julia> g(x) = x^2
g (generic function with 1 method)

julia> @time mapreduce(g,+,x)
  0.031165 seconds (16.85 k allocations: 1.149 MiB, 99.57% compilation time: 100% of which was recompilation)
261797.64934712922

# no signficant allocation, perfect
julia> @time mapreduce(g,+,x)
  0.000168 seconds (1 allocation: 16 bytes)
261797.64934712922

julia> f(x,y) = x * y
f (generic function with 1 method)

julia> mapreduce(f, +, x, y);

# bad allocations
julia> @time mapreduce(f, +, x, y);
  0.001142 seconds (6 allocations: 2.000 MiB)

Looking at the implementation there is no specialization.
Maybe we should adapt the over promising docstring then?

mapreduce is functionally equivalent to calling reduce(op, map(f, itr); init=init), but will in general execute faster since no intermediate collection
  needs to be created. See documentation for reduce and map.

Related to #38558 ?

Best,

Felix

The text was updated successfully, but these errors were encountered:

Moelf · 2024-02-21T12:48:30Z

looks like in some cases "bad allocation" case is actually faster:

julia> function manual(x,y)
           s = zero(eltype(x))
           for (i,j) in zip(x,y)
               s += i*j
           end
           return s
       end
manual (generic function with 1 method)

julia> @benchmark manual(x, y) setup=(x=rand(100,100); y=rand(100,100))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  9.783 μs …  31.469 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.813 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.928 μs ± 677.873 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▄▄▃▃▂▁▁▁           ▁▁               ▁▂▁                   ▁
  ███████████▇▇▇▇▇▇▇█▇█████▇▇▅▅▅▅▃▃▄▄▃▄▅███▇▆▆▅▅▅▅▃▄▄▃▃▄▄▄▄▄▂ █
  9.78 μs      Histogram: log(frequency) by time        11 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(*, +, x, y) setup=(x=rand(100,100); y=rand(100,100))
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.445 μs … 56.432 μs  ┊ GC (min … max): 0.00% … 69.95%
 Time  (median):     6.132 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.592 μs ±  3.665 μs  ┊ GC (mean ± σ):  5.15% ±  7.96%

  ▅█▂▁                                                       ▁
  █████▇█▇▇▆▅▅▆▅▅▃▃▃▁▃▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅ █
  5.44 μs      Histogram: log(frequency) by time       38 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

roflmaostc · 2024-02-21T12:49:54Z

Yeah but my GPU runs out of memory (gigabyte large arrays).
So I would rather like to avoid the allocations.

roflmaostc · 2024-02-21T12:53:20Z

Not sure if your benchmarking is fair. It looks like it's not SIMD...

julia> function manual(x,y)
           s = zero(eltype(x))
           @simd for i in 1:length(x)
               @inbounds s += x[i] * y[i]
           end
           return s
       end
manual (generic function with 1 method)

julia> @benchmark manual(x,y) setup=(x=rand(100, 100); y=rand(100, 100))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.406 μs …   4.790 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.559 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.616 μs ± 234.085 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▃  ▃▁ ▂▃▁                                                  
  ▃██▅▇███████▇▆▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.41 μs         Histogram: frequency by time        2.38 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(*, +, x,y) setup=(x=rand(100, 100); y=rand(100, 100))
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   7.389 μs … 282.207 μs  ┊ GC (min … max): 0.00% … 87.14%
 Time  (median):      9.740 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.754 μs ±  14.580 μs  ┊ GC (mean ± σ):  7.31% ±  5.95%

   ▇▂█                                                          
  ▇████▆▆▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  7.39 μs         Histogram: frequency by time         47.2 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

roflmaostc · 2024-02-21T13:12:56Z

I found a way to express this with foldl to achieve roughly the same:

julia> foldf = (acc, t) -> acc[1] + t[1] * t[2]
#5 (generic function with 1 method)

julia> @time foldl(foldf, zip(x, y))
  0.000031 seconds (5 allocations: 144 bytes)
2498.8165946524705

julia> f(x,y) = x * y
f (generic function with 1 method)

julia> @time mapreduce(f, +, x, y)
  0.000059 seconds (6 allocations: 78.281 KiB)
2498.794164503838

raminammour · 2024-02-21T16:25:11Z

Not sure if it covers all the use cases but maybe change this

julia/base/reducedim.jl

Lines 329 to 330 in 4e72944

    
           mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) = 
        
               reduce(op, map(f, A...); kw...)

to

mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
    reduce(op, Iterators.map(f, A...); kw...)

The lazy map won't allocate, as promised in the documentation. The manual version still seems faster though.

Should fix JuliaLang#53417, seeing if it passes tests.

mcabbott · 2024-02-21T16:50:25Z

Possible dup of #38558 as noted.

See #39053 and #41001 for earlier attempts to fix some cases.

roflmaostc · 2024-02-21T17:45:11Z

Just to confirm, on CUDA.jl this indeed is allocation free:

julia> x = CuArray(rand(512, 512));

julia> y = copy(x);

julia> f(x,y) = x*y
f (generic function with 1 method)

julia> CUDA.@time mapreduce(f, +, x,y);
  7.071664 seconds (14.90 M CPU allocations: 1010.226 MiB, 3.82% gc time) (2 GPU allocations: 232 bytes, 0.00% memmgmt time)

julia> CUDA.@time mapreduce(f, +, x,y);
  0.000195 seconds (62 CPU allocations: 2.578 KiB) (2 GPU allocations: 232 bytes, 9.79% memmgmt time)

julia> xc = Array(x);

julia> yc = Array(y);

julia> @time mapreduce(f, +, xc,yc);
  0.122801 seconds (328.86 k allocations: 24.198 MiB, 5.45% gc time, 98.93% compilation time)

julia> @time mapreduce(f, +, xc,yc);
  0.001084 seconds (6 allocations: 2.000 MiB)

raminammour added a commit to raminammour/julia that referenced this issue Feb 21, 2024

Avoid allocations in mapreduce with multiple arrays

ad7b4ad

Should fix JuliaLang#53417, seeing if it passes tests.

raminammour mentioned this issue Feb 21, 2024

Avoid allocations in mapreduce with multiple arrays #53419

Closed

mcabbott added performance Must go faster arrays [a, r, r, a, y, s] labels Feb 21, 2024

mbauman linked a pull request Jul 29, 2024 that will close this issue

avoid intermediate map allocations in multi-arg mapreduce #55301

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mapreduce with multiple arrays allocates #53417

mapreduce with multiple arrays allocates #53417

roflmaostc commented Feb 21, 2024 •

edited

Loading

Moelf commented Feb 21, 2024

roflmaostc commented Feb 21, 2024

roflmaostc commented Feb 21, 2024

roflmaostc commented Feb 21, 2024 •

edited

Loading

raminammour commented Feb 21, 2024

mcabbott commented Feb 21, 2024

roflmaostc commented Feb 21, 2024 •

edited

Loading

mapreduce with multiple arrays allocates #53417

mapreduce with multiple arrays allocates #53417

Comments

roflmaostc commented Feb 21, 2024 • edited Loading

Moelf commented Feb 21, 2024

roflmaostc commented Feb 21, 2024

roflmaostc commented Feb 21, 2024

roflmaostc commented Feb 21, 2024 • edited Loading

raminammour commented Feb 21, 2024

mcabbott commented Feb 21, 2024

roflmaostc commented Feb 21, 2024 • edited Loading

roflmaostc commented Feb 21, 2024 •

edited

Loading

roflmaostc commented Feb 21, 2024 •

edited

Loading

roflmaostc commented Feb 21, 2024 •

edited

Loading