Tuple-dialect has performance consequences (anonymous functions?) #17126

timholy · 2016-06-26T01:48:16Z

This may be #15276, but it seems that issue has various gradations of challenge, so perhaps additional examples are useful.

This contains test code implementing the equivalent of map(s->1:s, size(A)) using 3 different strategies: one based on ntuple, several variants of map, and one that manually assembles the tuples. On my laptop, the variation in performance of these methods appears to be >5x; presumably, we'd prefer that they all be the same (and all good!). Naturally, the best of these is also the most laborious (the manual method), and it beats all the others by a factor of 2.

This is admittedly nitty-gritty microoptimization, but was necessary to get #16260 to pass nanosoldier; as I work on the cleanup (#16973), I thought I'd better take the time to document some of the challenges.

# Comparing 3 different implementations of indices
# indices1: calling "ntuple" (a local implementation you can play with)
@inline indices1{T,N}(A::AbstractArray{T,N}) = myntuple(d->1:size(A,d), Val{N})
# indices2: using "map" (a local implementation you can play with)
# Note this is largely academic, since you'd rather define indices2(A)
# in terms of indices2(A, d). Informative nonetheless.
@inline indices2a(A) = mymap_a(s->1:s, size(A))
@inline indices2b(A) = mymap_b(s->1:s, size(A))
@inline indices2c(A) = mymap_c(s->1:s, size(A))
# indices3: directly unrolling the function
@inline indices3a(A) = _indices3a((), A)
@inline _indices3a{T,N}(out::NTuple{N}, A::AbstractArray{T,N}) = out
@inline _indices3a(out, A) = _indices3a((out..., 1:size(A, length(out)+1)), A)
@inline indices3b(A) = _indices3b(size(A))
@inline _indices3b(sz::Tuple) = _indices3b(sz...)
@inline _indices3b() = ()
@inline _indices3b(s::Integer, sz...) = (1:s, _indices3b(sz...)...)

# Like NTuple without any checking
@inline myntuple{F,N}(f::F, v::Type{Val{N}}) = _myntuple((), f, v)
@inline _myntuple{F,N}(out::NTuple{N}, f::F, ::Type{Val{N}}) = out
@inline _myntuple{F,N}(out, f::F, v::Type{Val{N}}) = _myntuple((out..., f(length(out)+1)), f, v)

# Like map
@inline mymap_a(f, t) = (f(t[1]), mymap_a(f, Base.tail(t))...)
@inline mymap_a(f, ::Tuple{}) = ()

# Alternative map
@inline mymap_b(f, t) = (_mymap_b(f, t...)...,)
@inline _mymap_b(f) = ()
@inline _mymap_b(f, h, t...) = (f(h), _mymap_b(f, t...)...)

# Another alternative map
@inline mymap_c(f, t) = _mymap_c((), f, t...)
@inline _mymap_c(out, f) = out
@inline _mymap_c(out, f, h, t...) = _mymap_c((out..., f(h)), f, t...)

A = rand(2,3,4,5)
using BenchmarkTools
println("ntuple implementation")
@show @benchmark indices1($A)
println("map implementations")
@show @benchmark indices2a($A)
@show @benchmark indices2b($A)
@show @benchmark indices2c($A)
println("manually-inlined implementation")
@show @benchmark indices3a($A)
@show @benchmark indices3b($A)

Results:

julia> include("/tmp/dialect.jl")
ntuple implementation
@benchmark(indices1($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     999
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     10.00 ns (0.00% GC)
  median time:      11.00 ns (0.00% GC)
  mean time:        11.82 ns (5.84% GC)
  maximum time:     1.32 μs (97.95% GC)
map implementations
@benchmark(indices2a($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     5.00 ns (0.00% GC)
  median time:      5.00 ns (0.00% GC)
  mean time:        5.01 ns (0.00% GC)
  maximum time:     17.00 ns (0.00% GC)
@benchmark(indices2b($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     4.00 ns (0.00% GC)
  median time:      4.00 ns (0.00% GC)
  mean time:        4.03 ns (0.00% GC)
  maximum time:     17.00 ns (0.00% GC)
@benchmark(indices2c($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     4.00 ns (0.00% GC)
  median time:      4.00 ns (0.00% GC)
  mean time:        4.05 ns (0.00% GC)
  maximum time:     24.00 ns (0.00% GC)
manually-inlined implementation
@benchmark(indices3a($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     2.00 ns (0.00% GC)
  median time:      2.00 ns (0.00% GC)
  mean time:        2.01 ns (0.00% GC)
  maximum time:     13.00 ns (0.00% GC)
@benchmark(indices3b($(Expr(:$, :A)))) = BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     5.00 ns (0.00% GC)
  median time:      5.00 ns (0.00% GC)
  mean time:        5.02 ns (0.00% GC)
  maximum time:     19.00 ns (0.00% GC)

The text was updated successfully, but these errors were encountered:

Jutho · 2016-06-26T23:35:27Z

mymap_a should probably call mymap_a instead of plain map. Does this affect the timing?

timholy · 2016-06-27T00:32:09Z

Great catch as always, @Jutho. Doesn't seem to affect the timing, though; the reason the map was there was because it was copy/pasted from map, so the two were essentially identical. But it certainly would have been confusing for anyone who made modifications trying to discover the cause, so thanks. EDIT: I edited the code above to fix this error.

timholy · 2016-06-29T17:46:02Z

Worth noting that making the top-level functions @pure gives them all the performance of indices3. I'm a bit unclear about whether this would be considered an abuse of @pure (I suspect the answer is "yes," since I think it's intended to be reserved for type-computations).

andyferris · 2016-06-30T22:41:36Z

(I suspect the answer is "yes," since I think it's intended to be reserved for type-computations).

Hmm... That sounds a bit strict.

The threads I read defined "pure" as any function the programmer is happy to tell the compiler it can run any time (once at compile time, more than once at run time, etc) without worrying about the consequence on program logic. (and the specific implementation of the compiler will affect the speed/improve inference/etc, but not change the program output).

andyferris · 2016-06-30T22:43:35Z

Follow-up: Obviously, having it be a function of types makes it most obvious to be able to run it at compile-time, but inference now tracks many constant values too...

timholy · 2016-07-01T09:24:30Z

I guess part of my thinking was that it would be nice to mark ntuple and map as pure, but an uncontrolled function argument makes that impossible. That concern doesn't apply to indices, since the function is known to be free of side effects.

Would be great to be able to say "do the expansion at compile time, but save the evaluation of the user function for run time" (naturally, @generated functions allow this).

blakejohnson · 2016-07-01T14:44:36Z

I think you are safe to mark indices as @pure.

timholy · 2016-08-01T08:55:16Z

Based on discussion in #17729, @pure isn't safe here. But in 6001540 I discovered that the source of the variation was due to our implementation of size (lack of sufficient @inline), NOT the particular way the tuple manipulations were written. That's reassuring, though wow what a huge waste of time for a missing @inline statement or two...

ntuple still performs very badly compared to the rest, but that can be a separate issue.

KristofferC · 2016-08-01T09:29:26Z

It feels like we (you @timholy) keep finding these cases where adding explicit @inlines significantly improves performance. Instead of playing whack a mole, doesn't this just mean that our inline heuristic could be improved. Is there a common case (lispy tuple style coding) that we could teach the inliner to reason better about?

timholy · 2016-08-01T10:19:31Z

I would love to see improvements in our inlining heuristics. I don't understand them at all, so that's not an area I've looked at or plan to look at.

I try to be fairly conservative about not adding @inline unless I'm sure it makes a difference, but in some cases this comes back to bite me.

kshyatt added the performance Must go faster label Jun 27, 2016

timholy mentioned this issue Jul 5, 2016

Inference (?) hang on recursive methods #17278

Closed

timholy mentioned this issue Jul 19, 2016

Loosen the signature on indices passed to _flatten. Fixes #17275. #17283

Merged

timholy mentioned this issue Aug 1, 2016

Some fixes for summary, show, and map for non-1 indices #17729

Merged

timholy closed this as completed Aug 1, 2016

timholy mentioned this issue Aug 20, 2016

Force inlining on indices(A, d) #18014

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuple-dialect has performance consequences (anonymous functions?) #17126

Tuple-dialect has performance consequences (anonymous functions?) #17126

timholy commented Jun 26, 2016 •

edited

Loading

Jutho commented Jun 26, 2016

timholy commented Jun 27, 2016 •

edited

Loading

timholy commented Jun 29, 2016

andyferris commented Jun 30, 2016

andyferris commented Jun 30, 2016

timholy commented Jul 1, 2016

blakejohnson commented Jul 1, 2016

timholy commented Aug 1, 2016

KristofferC commented Aug 1, 2016 •

edited

Loading

timholy commented Aug 1, 2016

Tuple-dialect has performance consequences (anonymous functions?) #17126

Tuple-dialect has performance consequences (anonymous functions?) #17126

Comments

timholy commented Jun 26, 2016 • edited Loading

Jutho commented Jun 26, 2016

timholy commented Jun 27, 2016 • edited Loading

timholy commented Jun 29, 2016

andyferris commented Jun 30, 2016

andyferris commented Jun 30, 2016

timholy commented Jul 1, 2016

blakejohnson commented Jul 1, 2016

timholy commented Aug 1, 2016

KristofferC commented Aug 1, 2016 • edited Loading

timholy commented Aug 1, 2016

timholy commented Jun 26, 2016 •

edited

Loading

timholy commented Jun 27, 2016 •

edited

Loading

KristofferC commented Aug 1, 2016 •

edited

Loading