Performance of splatting a number #29114

KristofferC · 2018-09-10T13:49:53Z

In a few places, the performance of splatting a number has shown up: #29060, #28764 (comment) and working around this give signficiant speed-ups.
The code is well inferred but there is a call to _apply instead of the actual method. This dissapears if the number is wrapped in a tuple.

It would be nice to have this fixed so these type of workarounds are no longer needed.

Ref #27434 (comment)

cc @mbauman, @martinholters

The text was updated successfully, but these errors were encountered:

Keno · 2018-09-10T16:44:37Z

We could special case numbers as we do tuples in the inliner. The more generic solution would have to emulate the iteration protocol.

mbauman · 2018-09-10T16:47:49Z

Forgive me if this is a naive understanding of inference, but it seems like there's a middle-ground here — isn't it fairly easy to see that iterate(::X,::Any)::Const(nothing) for Xs that only iterate once.

Keno · 2018-09-10T16:52:15Z

Yes, but that's still emulating the iteration protocol in the optimizer (inference already does some of this) but the iterate isn't part of the code (since _apply does it internally). Basically you need to implement this TODO:

julia/base/compiler/ssair/inlining.jl

Line 831 in 3c8119f

# TODO: We could basically run the iteration protocol here

or special case it there.

c42f · 2019-07-25T21:45:06Z

I just noticed we share basically the same problem over at JuliaArrays/StaticArrays.jl#361.

What if we change lowering of splatting so that f(x...) turns into something like Core._apply(f, Base.splatarg(x))? We can have Base.splatarg(x) = x by default (or possibly collect into a tuple via iteration?) For some more trivial cases like Number and StaticVector a couple of overrides would solve this issue:

Base.splatarg(x::Number) = (x,)
Base.splatarg(v::StaticVector) = Tuple(v)

This does mean that splatting would become configurable by user types but that seems generally consistent with the way we usually lower syntax into function calls.

yurivish · 2019-07-26T03:30:28Z

Nice idea. Maybe it can be Base.splattable for consistency with Base.broadcastable.

c42f · 2019-07-26T07:34:21Z

Base.splattable ...

Interesting analogy; we could go with it, though I think the name is less evocative than splatarg.

@Keno's point about making inference aware of iteration in _apply is might just be a better option if we want splatting to be inherently defined by iteration. Right now I can't think of a reason why you'd ever want splatting to be different, but maybe someone with a better vision of the Julian far future can (I'm looking at you @andyferris :-P)

andyferris · 2019-07-26T09:29:16Z

Haha... no, I think I believe in strong interfaces, and guarantees the compiler doesn’t break for the purpose of performance optimizations.

Using splatarg at least puts control/correctness in the hands of users. On the other hand, it puts control/correctness in the hands of users... ! It feels a bit like literal_pow - a clever and useful lowering trick but relies on users’ good behaviour to maintain the illusion of the fundamental guarantee of referencial transparency (not saying I’ve thought of anything better...).

So, to my personal tastes we would only resort to that if Keno’s approach is too messy. But my opinion isn’t strong.

c42f · 2019-07-26T12:20:08Z

Yes, something which can be overridden only really makes sense if there's a compelling reason to allow splatting to be different from iteration in some cases. Otherwise it's just a nasty performance hack. My question was really whether such a use case exists. In hindsight that's actually rather orthogonal to solving this issue.

Another thought: I don't know how hard it is to teach inference about iteration in _apply, but an interesting alternative would be to surface the C implementation of argument list construction in _apply to the Julia level. That is, to lower

f(x, y..., z...)

into something like

args = flatten_args((x,), y, z)
_apply(f, args)

where flatten_args applies the iteration protocol to construct the splatted argument list.

andyferris · 2019-07-26T12:42:32Z

OK - that seems pretty interesting!

c42f · 2019-07-29T04:53:07Z

It's currently tricky to surface the C implementation of _apply argument flattening in a useful way. Ideally we'd want something roughly like

"When possible", arguments are flattened into a single tuple via iteration
With dynamically known small total length, fall back to using a stack allocated buffer of Any. I guess that's not possible right now.
With unknown total length, flatten args into a Vector{Any}

One approach to implementing "when possible" would be to generalize the IteratorSize trait to express lengths which are known based on the type of the input; a Base version of StaticArrays.Length{N}(). This would require that users know about and implement Base.IteratorSize for statically sized objects. A difficulty is that IteratorSize mixes together the expression of length via HasLength and size via HasShape.

Doing more in inference to constant propagate the iteration state for small finite iterations seems helpful and we might get rid of Base.indexed_iterate as part of that. I found the output of the following confusing in fairly recent julia-1.3:

julia> function iter_unpack(t)
           i1 = iterate(t)
           a = i1[1]
           i2 = iterate(t, i1[2])
           b = i2[1]
           b
       end
iter_unpack (generic function with 1 method)

julia> @code_warntype optimize=true iter_unpack((1,2))
Variables
  #self#::Core.Compiler.Const(iter_unpack, false)
  t::Tuple{Int64,Int64}
  i1::Union{Nothing, Tuple{Int64,Int64}}
  a::Int64
  i2::Union{Nothing, Tuple{Int64,Int64}}
  b::Int64

# big pile of code ...
# i1 and i2 are partially const but that seems not to be inferred

tkf · 2019-08-08T17:03:47Z

something which can be overridden only really makes sense if there's a compelling reason to allow splatting to be different from iteration in some cases

If we are going to add an API to manipulate splatting behavior, it would be great if it can also be dispatched on in the function side. For example, this lets you dispatch +(xs...) to sum(xs) when xs is an array. I think this is appealing because it would let any binary function op dispatch op(xs...) to reduce(op, xs) or foldl(op, xs). This would give us a concise and intuitive syntax for reduction. I'm thinking something like op(xs::GenericVararg{<:AbstractArray}) while op(xs::GenericVararg{<:Tuple}) is equivalent to op(xs::Vararg). It would also be nice if f.(xs) in op(f.(xs)...) is not materialized; we can then fuse the mapping and reduction by defining op(xs::GenericVararg{<:Broadcasted}).

c42f · 2019-08-09T21:36:43Z

@tkf that's a very interesting idea to get actual syntax for reduce and mapreduce while preserving the meaning of splatting.

*(xs...)        sum(xs)
+(xs...)        prod(xs)
min(xs...)      minimum(xs)
max(xs...)      maximum(xs)

+(abs.(xs)...)  mapreduce(abs, +, xs)
+(xs.^2...)     mapreduce(x->x^2, +, xs)

That last one is particularly appealing syntax. One problem is that it would mean defining a lot of op(xs::GenericVararg{<:Union{Broadcasted,AbstractArray}}) for many different op which doesn't seems scalable. Just like the vectorized function problem again. Perhaps there's an acceptable lowering somewhere here which pairs some-syntax and reduction in a similar way that . and broadcast work.

tkf · 2019-08-10T01:48:43Z

One problem is that it would mean defining a lot of op(xs::GenericVararg{<:Union{Broadcasted,AbstractArray}}) for many different op which doesn't seems scalable.

I think there are multiple solutions. For example:

Abstract-type based solution

abstract type AssociativeOperator <: Function end

(op::AssociativeOperator)(xs::GenericVararg{<:Union{AbstractArray, Broadcasted}}) =
    reduce(op, xs)

struct Plus <: Associative end
struct Multiply <: Associative end
struct Vcat <: Associative end

const + = Plus()
const * = Multiply()
const vcat = Vcat()

+(x, y) = ...
*(x, y) = ...
vcat(x, y) = ...

Different lowering

Alternatively, you can lower f(xs...) to splatapply(f, GenericVararg(xs)) and define

const AssociativeOperator = Union{
    typoef(*),
    typoef(+),
    typoef(vcat),
    ...and so on...
}

splatapply(f::AssociativeOperator, xs::GenericVararg{<:Union{AbstractArray, Broadcasted}}) =
    reduce(f, xs)

tkf · 2019-08-10T18:36:21Z

I guess it's probably a better idea to discuss shorter term solution here for focusing on performance improvements and I don't want to further clutter this issue. So, I opened a new issue #32860 to discuss more wild ideas. @c42f Let's discuss the idea there.

martinholters · 2019-08-21T13:02:43Z

Let me repeat the main points from #27434 (comment): Inference already simulates the iteration protocol (to some degree), however, that is solely used to determine the return type of the _apply, not to elide it:

julia> foo(x) = (x...,)
foo (generic function with 1 method)

julia> @code_typed foo(1)
CodeInfo(
1 ─ %1 = Core._apply(Core.tuple, x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

As you can see, inference correctly determines that splatting an Int64 will yield exactly one Int64 and calling tuple with it will give a Tuple{Int64}. The missing puzzle piece is having the inliner replace the _apply call with an appropriate sequence of iterate calls. Note that all the intermediate return types of these iterate calls have been determined during inference, but I think we lack the infrastructure to feed this information through to the optimizer. Given that information, it should be relatively easy (cough I guess) to synthesize the appropriate iterate sequence for cases with statically known, small number of resulting elements (e.g. a Number).

That said, inference does not currently propagate constants during the iteration emulation, so that e.g.

julia> @code_typed foo(@SVector [1,2,3])
CodeInfo(
1 ─ %1 = Core._apply(Core.tuple, x)::Tuple{Int64,Vararg{Int64,N} where N}
└──      return %1
) => Tuple{Int64,Vararg{Int64,N} where N}

That might be fixable, however, recognizing that it is actually on of the elementary types (here a Tuple) that is being splat might still proof advantageous, not sure. In hindsight, it would have been nice if we had added an initial iterates_as call to the iteration protocol, defaulting to iterates_as(x) = x. Then one could easily e.g. hand out the internal tuple from an SVector without risking inconsistency between iterating and splatting as with the proposed splatarg.

martinholters · 2019-08-21T13:26:41Z

Regarding the more general issue, I did some experimenting I also want to share.

I started by assuming Tuple(x) == (x...,) axiomatic and turning it around, lowering e.g. f(x..., y) as Core._apply(f, Tuple(x), (y,)). Of course, this needs a definition of Tuple(x) that does not rely on splatting, except for splatting Tuples, as Tuple(x::Tuple) = x is trivial. What I came up with is an iterate sequence (basically a manually unrolled loop) up to a certain limit, and if that is exceeded, calling a new bulit-in function that can turn a SimpleVector, NamedTuple, or Vector into a tuple---basically a trimmed down version of _apply with the trivially inferred cases only.

I got this working (mostly) with modest effort and my conclusions are: Excellent for cases with inferable (=size statically known) result of Tuple(x), where the _apply call can be elided. But performance penalties otherwise, as having to store the intermediate values in a dynamically allocated tuple is less than ideal. E.g. splatting a vector became about 3x slower.

I have a feeling this is a dead end due to the unavoidable penalties in the size-not-statically-known case, but maybe someone has a good idea...

c42f · 2019-08-22T06:54:22Z

What I came up with is an iterate sequence (basically a manually unrolled loop) up to a certain limit

Right, the difficulty is all about the flattening "when possible" referred to in #29114 (comment).

We can try to get at that information via your partial unrolling approach, via traits, or via better inference. It's dissatisfying to introduce traits if we can solve the problem by being smarter with iteration in inference.

It would be interesting to see where you got to with the partial unrolling approach. It looks like part of that is on the mh/totuple_builtin branch?

martinholters · 2019-08-22T07:37:18Z

The mh/totuple_builtin branch contains the code for the new builtin, but I'm not sure that's actually needed. While the code is simpler than for _apply, it doesn't seem to add any performance advantage compared to using _apply(tuple, x). Nevertheless, I'll try to bring my code in a less messy state and put it in a branch for others to experiment. I'll post here once done.

I also had another idea that might save this approach. Say we lower f(x...) as _apply(f, Tuple(x)) and define Tuple(x) = _apply(tuple, x), then inlining would give _apply(f, _apply(tuple, x)) where then the optimizer could elide the inner _apply and get back to _apply(f, x). However, we could still provide for efficient Tuple constructors for specific types (and maybe a partially unrolled default fallback, as long as it is inlined so that the same elision of the inner _apply remains possible). Not sure when I'll find time to drive this further, though.

martinholters · 2019-08-23T15:07:24Z

Just pushed the current state of my efforts to mh/rework_apply.

c42f · 2019-08-26T00:35:37Z

Great thanks @martinholters! Glancing at the unrolling I'm a little more optimistic about that approach. To use Core._apply we need a sequence of collections which will all be flattened together into the final arg list. But there doesn't need to be a 1:1 correspondence between the input args in the surface syntax and the intermediate set of tuples. So I wonder whether we can use that to our advantage.

martinholters · 2019-08-27T06:56:52Z

One inconvenience when wrapping the arguments in a function, whether it is splatarg or Tuple, is nested splatting. (I only learned we supported this yesterday, btw.) E.g. f((x...)...) becomes _apply(_apply, (f,), x). Sure, we could lower this to _apply(_apply, (f,), splatarg(x)) instead, but we also need to call splatarg on the elements of x.

@benchmark

As noted in #36087 and #29114, splatting integers currently has a performance penalty that is unexpected. For tuples and SimpleVectors, we have special purpose inliners that will simply inline the tuple/SimpleVector into the call being splatted. However, for everything else we'd have to run the iteration protocol to find out what the values to substitute are. This change does just that, limited to the case of length-1 (and empty) iterables. Benchmark: ``` f(x) = (x...,) @code_typed f(1) @benchmark f(1) ``` Before: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 32 bytes allocs estimate: 2 -------------- minimum time: 209.357 ns (0.00% GC) median time: 213.404 ns (0.00% GC) mean time: 218.674 ns (0.16% GC) maximum time: 1.922 μs (0.00% GC) -------------- samples: 10000 evals/sample: 540 ``` After: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = invoke Base.iterate(_2::Int64)::Tuple{Int64,Nothing} │ %2 = (getfield)(%1, 1)::Int64 │ %3 = (getfield)(%1, 2)::Nothing │ invoke Base.iterate(_2::Int64, %3::Nothing)::Nothing │ %5 = Core.tuple(%2)::Tuple{Int64} └── return %5 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 3.044 ns (0.00% GC) median time: 3.047 ns (0.00% GC) mean time: 3.049 ns (0.00% GC) maximum time: 7.700 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Obviously this isn't 100% optimal yet, because the `iterate` calls themselves don't get inlined, but it's a lot better. Inlining the `iterate` calls is left for a follow up commit.

@benchmark

This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114

@benchmark

This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114

@benchmark

This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114

@benchmark

This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114

@benchmark

This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114

@benchmark

This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114

@benchmark

This supersedes JuliaLang#36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in JuliaLang#36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes JuliaLang#36087 Fixes JuliaLang#29114

KristofferC added the performance Must go faster label Sep 10, 2018

This was referenced Oct 15, 2018

Inlining dependent on @nospecialize? #29147

Closed

Performance of @views with reshape #29675

Closed

mbauman mentioned this issue Dec 10, 2018

slow unidimensional circshift on dense (small) vectors #30336

Closed

c42f mentioned this issue Jul 25, 2019

Performance of splatting a SVector JuliaArrays/StaticArrays.jl#361

Closed

tkf mentioned this issue Aug 10, 2019

Julep: Syntax for reduction (overloadable splatting?) #32860

Open

KristofferC mentioned this issue Aug 28, 2019

Poor performance of mul! with scalar JuliaLang/LinearAlgebra.jl#661

Closed

tkf mentioned this issue Aug 29, 2019

Workaround for performance penalty of splatting a number #33108

Merged

vtjnash mentioned this issue Mar 15, 2020

map(a -> zip(a...), itr) inference imprecision #33634

Closed

vtjnash added compiler:inference Type inference compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) labels Mar 15, 2020

c42f mentioned this issue May 11, 2020

Unexpected allocation when splatting JuliaArrays/StaticArrays.jl#784

Closed

KristofferC mentioned this issue May 30, 2020

Splat of Integer needs to be special cased #36087

Closed

Keno mentioned this issue Jun 6, 2020

Inline singleton splats #36169

Closed

Keno mentioned this issue Jul 15, 2020

Propagate iteration info to optimizer #36684

Merged

Keno closed this as completed in #36684 Jul 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of splatting a number #29114

Performance of splatting a number #29114

KristofferC commented Sep 10, 2018

Keno commented Sep 10, 2018

mbauman commented Sep 10, 2018

Keno commented Sep 10, 2018 •

edited

Loading

c42f commented Jul 25, 2019

yurivish commented Jul 26, 2019

c42f commented Jul 26, 2019

andyferris commented Jul 26, 2019

c42f commented Jul 26, 2019 •

edited

Loading

andyferris commented Jul 26, 2019

c42f commented Jul 29, 2019

tkf commented Aug 8, 2019

c42f commented Aug 9, 2019

tkf commented Aug 10, 2019 •

edited

Loading

tkf commented Aug 10, 2019

martinholters commented Aug 21, 2019

martinholters commented Aug 21, 2019

c42f commented Aug 22, 2019

martinholters commented Aug 22, 2019

martinholters commented Aug 23, 2019

c42f commented Aug 26, 2019 •

edited

Loading

martinholters commented Aug 27, 2019

Performance of splatting a number #29114

Performance of splatting a number #29114

Comments

KristofferC commented Sep 10, 2018

Keno commented Sep 10, 2018

mbauman commented Sep 10, 2018

Keno commented Sep 10, 2018 • edited Loading

c42f commented Jul 25, 2019

yurivish commented Jul 26, 2019

c42f commented Jul 26, 2019

andyferris commented Jul 26, 2019

c42f commented Jul 26, 2019 • edited Loading

andyferris commented Jul 26, 2019

c42f commented Jul 29, 2019

tkf commented Aug 8, 2019

c42f commented Aug 9, 2019

tkf commented Aug 10, 2019 • edited Loading

Abstract-type based solution

Different lowering

tkf commented Aug 10, 2019

martinholters commented Aug 21, 2019

martinholters commented Aug 21, 2019

c42f commented Aug 22, 2019

martinholters commented Aug 22, 2019

martinholters commented Aug 23, 2019

c42f commented Aug 26, 2019 • edited Loading

martinholters commented Aug 27, 2019

Keno commented Sep 10, 2018 •

edited

Loading

c42f commented Jul 26, 2019 •

edited

Loading

tkf commented Aug 10, 2019 •

edited

Loading

c42f commented Aug 26, 2019 •

edited

Loading