-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of splatting a number #29114
Comments
We could special case numbers as we do tuples in the inliner. The more generic solution would have to emulate the iteration protocol. |
Forgive me if this is a naive understanding of inference, but it seems like there's a middle-ground here — isn't it fairly easy to see that |
Yes, but that's still emulating the iteration protocol in the optimizer (inference already does some of this) but the iterate isn't part of the code (since _apply does it internally). Basically you need to implement this TODO: julia/base/compiler/ssair/inlining.jl Line 831 in 3c8119f
or special case it there. |
I just noticed we share basically the same problem over at JuliaArrays/StaticArrays.jl#361. What if we change lowering of splatting so that Base.splatarg(x::Number) = (x,)
Base.splatarg(v::StaticVector) = Tuple(v) This does mean that splatting would become configurable by user types but that seems generally consistent with the way we usually lower syntax into function calls. |
Nice idea. Maybe it can be |
Interesting analogy; we could go with it, though I think the name is less evocative than @Keno's point about making inference aware of iteration in |
Haha... no, I think I believe in strong interfaces, and guarantees the compiler doesn’t break for the purpose of performance optimizations. Using So, to my personal tastes we would only resort to that if Keno’s approach is too messy. But my opinion isn’t strong. |
Yes, something which can be overridden only really makes sense if there's a compelling reason to allow splatting to be different from iteration in some cases. Otherwise it's just a nasty performance hack. My question was really whether such a use case exists. In hindsight that's actually rather orthogonal to solving this issue. Another thought: I don't know how hard it is to teach inference about iteration in f(x, y..., z...) into something like args = flatten_args((x,), y, z)
_apply(f, args) where |
OK - that seems pretty interesting! |
It's currently tricky to surface the C implementation of _apply argument flattening in a useful way. Ideally we'd want something roughly like
One approach to implementing "when possible" would be to generalize the Doing more in inference to constant propagate the iteration state for small finite iterations seems helpful and we might get rid of julia> function iter_unpack(t)
i1 = iterate(t)
a = i1[1]
i2 = iterate(t, i1[2])
b = i2[1]
b
end
iter_unpack (generic function with 1 method)
julia> @code_warntype optimize=true iter_unpack((1,2))
Variables
#self#::Core.Compiler.Const(iter_unpack, false)
t::Tuple{Int64,Int64}
i1::Union{Nothing, Tuple{Int64,Int64}}
a::Int64
i2::Union{Nothing, Tuple{Int64,Int64}}
b::Int64
# big pile of code ...
# i1 and i2 are partially const but that seems not to be inferred |
If we are going to add an API to manipulate splatting behavior, it would be great if it can also be dispatched on in the function side. For example, this lets you dispatch |
@tkf that's a very interesting idea to get actual syntax for *(xs...) sum(xs)
+(xs...) prod(xs)
min(xs...) minimum(xs)
max(xs...) maximum(xs)
+(abs.(xs)...) mapreduce(abs, +, xs)
+(xs.^2...) mapreduce(x->x^2, +, xs) That last one is particularly appealing syntax. One problem is that it would mean defining a lot of |
I think there are multiple solutions. For example: Abstract-type based solutionabstract type AssociativeOperator <: Function end
(op::AssociativeOperator)(xs::GenericVararg{<:Union{AbstractArray, Broadcasted}}) =
reduce(op, xs)
struct Plus <: Associative end
struct Multiply <: Associative end
struct Vcat <: Associative end
const + = Plus()
const * = Multiply()
const vcat = Vcat()
+(x, y) = ...
*(x, y) = ...
vcat(x, y) = ... Different loweringAlternatively, you can lower const AssociativeOperator = Union{
typoef(*),
typoef(+),
typoef(vcat),
...and so on...
}
splatapply(f::AssociativeOperator, xs::GenericVararg{<:Union{AbstractArray, Broadcasted}}) =
reduce(f, xs) |
Let me repeat the main points from #27434 (comment): Inference already simulates the iteration protocol (to some degree), however, that is solely used to determine the return type of the julia> foo(x) = (x...,)
foo (generic function with 1 method)
julia> @code_typed foo(1)
CodeInfo(
1 ─ %1 = Core._apply(Core.tuple, x)::Tuple{Int64}
└── return %1
) => Tuple{Int64} As you can see, inference correctly determines that splatting an That said, inference does not currently propagate constants during the iteration emulation, so that e.g. julia> @code_typed foo(@SVector [1,2,3])
CodeInfo(
1 ─ %1 = Core._apply(Core.tuple, x)::Tuple{Int64,Vararg{Int64,N} where N}
└── return %1
) => Tuple{Int64,Vararg{Int64,N} where N} That might be fixable, however, recognizing that it is actually on of the elementary types (here a |
Regarding the more general issue, I did some experimenting I also want to share. I started by assuming I got this working (mostly) with modest effort and my conclusions are: Excellent for cases with inferable (=size statically known) result of I have a feeling this is a dead end due to the unavoidable penalties in the size-not-statically-known case, but maybe someone has a good idea... |
Right, the difficulty is all about the flattening "when possible" referred to in #29114 (comment). We can try to get at that information via your partial unrolling approach, via traits, or via better inference. It's dissatisfying to introduce traits if we can solve the problem by being smarter with iteration in inference. It would be interesting to see where you got to with the partial unrolling approach. It looks like part of that is on the |
The I also had another idea that might save this approach. Say we lower |
Just pushed the current state of my efforts to |
Great thanks @martinholters! Glancing at the unrolling I'm a little more optimistic about that approach. To use |
One inconvenience when wrapping the arguments in a function, whether it is |
As noted in #36087 and #29114, splatting integers currently has a performance penalty that is unexpected. For tuples and SimpleVectors, we have special purpose inliners that will simply inline the tuple/SimpleVector into the call being splatted. However, for everything else we'd have to run the iteration protocol to find out what the values to substitute are. This change does just that, limited to the case of length-1 (and empty) iterables. Benchmark: ``` f(x) = (x...,) @code_typed f(1) @benchmark f(1) ``` Before: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 32 bytes allocs estimate: 2 -------------- minimum time: 209.357 ns (0.00% GC) median time: 213.404 ns (0.00% GC) mean time: 218.674 ns (0.16% GC) maximum time: 1.922 μs (0.00% GC) -------------- samples: 10000 evals/sample: 540 ``` After: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = invoke Base.iterate(_2::Int64)::Tuple{Int64,Nothing} │ %2 = (getfield)(%1, 1)::Int64 │ %3 = (getfield)(%1, 2)::Nothing │ invoke Base.iterate(_2::Int64, %3::Nothing)::Nothing │ %5 = Core.tuple(%2)::Tuple{Int64} └── return %5 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 3.044 ns (0.00% GC) median time: 3.047 ns (0.00% GC) mean time: 3.049 ns (0.00% GC) maximum time: 7.700 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Obviously this isn't 100% optimal yet, because the `iterate` calls themselves don't get inlined, but it's a lot better. Inlining the `iterate` calls is left for a follow up commit.
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This supersedes #36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in #36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes #36087 Fixes #29114
This supersedes JuliaLang#36169. Rather than re-implementing the iteration analysis as done there, this uses the new stmtinfo infrastrcture to propagate all the analysis done during inference all the way to inlining. As a result, it applies not only to splats of singletons, but also to splats of any other short iterable that inference can analyze. E.g.: ``` f(x) = (x...,) @code_typed f(1=>2) @benchmark f(1=>2) ``` Before: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64} └── return %1 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 3 -------------- minimum time: 242.659 ns (0.00% GC) median time: 246.904 ns (0.00% GC) mean time: 255.390 ns (1.08% GC) maximum time: 4.415 μs (93.94% GC) -------------- samples: 10000 evals/sample: 405 ``` After: ``` julia> @code_typed f(1=>2) CodeInfo( 1 ─ %1 = Base.getfield(x, 1)::Int64 │ %2 = Base.getfield(x, 2)::Int64 │ %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64} └── return %3 ) => Tuple{Int64,Int64} julia> @benchmark f(1=>2) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.701 ns (0.00% GC) median time: 1.925 ns (0.00% GC) mean time: 1.904 ns (0.00% GC) maximum time: 6.941 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` I also implemented the TODO, I had left in JuliaLang#36169 to inline the iterate calls themselves, which gives another 3x improvement over the solution in that PR: ``` julia> @code_typed f(1) CodeInfo( 1 ─ %1 = Core.tuple(x)::Tuple{Int64} └── return %1 ) => Tuple{Int64} julia> @benchmark f(1) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.696 ns (0.00% GC) median time: 1.699 ns (0.00% GC) mean time: 1.702 ns (0.00% GC) maximum time: 5.389 ns (0.00% GC) -------------- samples: 10000 evals/sample: 1000 ``` Fixes JuliaLang#36087 Fixes JuliaLang#29114
In a few places, the performance of splatting a number has shown up: #29060, #28764 (comment) and working around this give signficiant speed-ups.
The code is well inferred but there is a call to
_apply
instead of the actual method. This dissapears if the number is wrapped in a tuple.It would be nice to have this fixed so these type of workarounds are no longer needed.
Ref #27434 (comment)
cc @mbauman, @martinholters
The text was updated successfully, but these errors were encountered: