-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up permsort by utilizing stability of the default sorting algorithm #47587
base: master
Are you sure you want to change the base?
Conversation
…roduces regressions.)
FIXES UNEXPECTED ALLOCATIONS removes code that previously harbored bugs that slipped through the test suite
Fixes a few remaining unexpected allocations U can be statically computed from the type of v and order so there is no need. Further, U is infered as ::DataType rather than Type{U} which causes type instabilities.
it is invalid to cache lenm1 because lo and hi may be redefined and we have no cache invalidation system
fixes JuliaLang#47474 in this PR rather than separate to avoid dealing with the merge
… sorting internals
make _sort! return scratch space rather than sorted vector so that things like IEEEFloatOptimization can re-use the scratch space allocated on their first recursive call
This is a great improvement but I think the naming is confusing: I would have guessed that |
Line 625 in 812c917
|
Thanks!
I think it is possible to get substantially larger speedups by computing sortperm with a ZipVector-like data structure, but so long as PermUnstable is not exported, this change shouldn't interfere with that possibility. Indeed it synergizes well for cases when a ZipVector is not appropriate (small arrays, perhaps?). |
@nanosoldier |
Your job failed. |
What about sizes between 10 and 40 when we currently perform IEEEFloatOptimizations and then automatically dispatch to InsertionSort? I imagine the allocations could be expensive in that regime. w.r.t nanosoldier, we know there is a problem, but we don't know what it is (#47795) |
Yes, there is a slight regression and we should address these allocations. (Btw, I would expect only a single allocation on master for julia> @btime sortperm(x) setup=(x=rand(12)); # master
531.895 ns (5 allocations: 368 bytes)
julia> @btime sortperm(x) setup=(x=rand(12)); # PR
574.639 ns (7 allocations: 688 bytes) |
There is some type instability in sortperm (separated from the sorting by function barriers) This accounts for 3 of the allocations. Also, the dispatch to InsertionSort for sizes smaller than 40 is only present for UIntMappable types and orders, and Perm orderings are not UIntMappable so we go through the new allocating QuickSort. This accounts for the 4th allocation, and one allocation is expected. Unfortunately, there is a substantial regression in 1.9 compared to 1.8 on small sortperm. This would be covered by BaseBenchmarks after JuliaCI/BaseBenchmarks.jl#305 & after it stops failing. I'll open an issue. |
I'll evaluate PR properly once allocation regression (#47811) is fixed to measure the influence of extra allocations and optimize it. |
@nanosoldier |
Your benchmark job has completed - successfully executed benchmarks. A full report can be found here. |
@nanosoldier |
Your benchmark job has completed - no performance regressions were detected. A full report can be found here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I benchmarked this with JuliaCI/BaseBenchmarks.jl#305 and found a minor regression for
@benchmark sort(x; by=x -> x isa Symbol ? (0, x) : (1, x)) setup=(x=[rand() < .5 ? randstring() : Symbol(randstring()) for _ in 1:30])
but several more substantial improvements to other benchmarks. It would be nice to know what causes the regression for non-concrete eltypes.
end | ||
end | ||
|
||
# This is similar to the partition function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have code re-use; the loop here is almost exactly the same as the loops in partition!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice. However, partition!
left a pivot on the place and split to two parts with. Thus it seems challenging.
base/sort.jl
Outdated
Preserves the order of the elements. | ||
""" | ||
function send_to_end_stable!(f::F, v::AbstractVector; lo=firstindex(v), hi=lastindex(v)) where F <: Function | ||
tmp=copy(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, this should utilize the scratch space re-use system, though it this PR is a performance improvement without, then that isn't essential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to re-use the scratch array, but not sure if i got your @getkw
concept right.
Co-authored-by: Lilith Orion Hafner <lilithhafner@gmail.com>
@@ -603,19 +640,26 @@ function _sort!(v::AbstractVector, a::IEEEFloatOptimization, o::Ordering, kw) | |||
j = send_to_end!(x -> after_zero(o, x), v; lo, hi) | |||
scratch = _sort!(iv, a.next, Reverse, (;kw..., lo, hi=j)) | |||
if scratch === nothing # Union split | |||
_sort!(iv, a.next, Forward, (;kw..., lo=j+1, hi, scratch)) | |||
_sort!(iv, a.next, Forward, (;kw..., lo=j+1, hi, nothing)) | |||
else | |||
_sort!(iv, a.next, Forward, (;kw..., lo=j+1, hi, scratch)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LilithHafner Btw, is scratch
type-stable here, i.e., can compiler infer that scratch
cannot be Nothing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall that removing the if statement entirely introduced dynamic dispatch. IICU, all of type inference is an implementation detail, so I'm not totally sure, but I believe that we need an if statement to force union splitting though it works well whether we use _sort!(..., nothing)
or _sort!(..., scratch)
because the compiler can determine the type of scratch
at compile time..
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
The more I think about it, the more I'm convinced the whole julia> x = rand(10000000);
julia> sp(x) = sort(collect(a for a in zip(x, eachindex(x))), by = x -> x[1]) .|> x -> x[2]
sp (generic function with 1 method)
julia> @time sortperm(x); # This PR, already compiled
1.389301 seconds (7 allocations: 152.588 MiB, 1.04% gc time)
julia> @time sp(x);
1.182990 seconds (8 allocations: 534.058 MiB, 1.73% gc time)
julia> sp(x) == sortperm(x)
true |
+1 for
|
This is a draft on how to speed up
permsort
. The idea is that it is not necessary to maintain the stability byPerm
if a stable sorting algorithm is used. ThusPermUnstable
is introduces with a simplify and fasterlt(p::PermUnstable, a::Integer, b::Integer)
implementation.Notice the PR builds upon #47383 that has not been merged yet.
@LilithHafner Do you like this idea (as the original author of the changes)?