Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up permsort by utilizing stability of the default sorting algorithm #47587

Draft
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

petvana
Copy link
Member

@petvana petvana commented Nov 15, 2022

This is a draft on how to speed up permsort. The idea is that it is not necessary to maintain the stability by Perm if a stable sorting algorithm is used. Thus PermUnstable is introduces with a simplify and faster lt(p::PermUnstable, a::Integer, b::Integer) implementation.

Notice the PR builds upon #47383 that has not been merged yet.

@LilithHafner Do you like this idea (as the original author of the changes)?

julia> using BenchmarkTools

#47383
julia> @btime sortperm(x) setup=(x=rand(1000));
  29.741 μs (5 allocations: 15.92 KiB)

# This PR
julia> @btime sortperm(x) setup=(x=rand(1000));
  23.349 μs (5 allocations: 15.92 KiB)

Lilith Hafner and others added 22 commits November 8, 2022 15:04
FIXES UNEXPECTED ALLOCATIONS
removes code that previously harbored bugs that slipped through the test suite
Fixes a few remaining unexpected allocations
U can be statically computed from the type of v and order so there is no need.
Further, U is infered as ::DataType rather than Type{U} which causes type instabilities.
it is invalid to cache lenm1 because lo and hi may be redefined
and we have no cache invalidation system
fixes JuliaLang#47474
in this PR rather than separate to avoid dealing with the merge
make _sort! return scratch space rather than sorted vector
so that things like IEEEFloatOptimization can re-use the
scratch space allocated on their first recursive call
@petvana petvana added performance Must go faster sorting Put things in order labels Nov 15, 2022
@StefanKarpinski
Copy link
Member

This is a great improvement but I think the naming is confusing: I would have guessed that PermUnstable would be used together with an unstable sorting algorithm, not the other way around.

@petvana
Copy link
Member Author

petvana commented Nov 16, 2022

PermUnstable is really a bad name. I can imagine smth like FastPerm or PermFast, inspired by fast math. Currently. the CI fails and I would blame the reverse ordering in IEEEFloatOptimization:

scratch = _sort!(v, a.next, PermT(Reverse, ip), (;kw..., lo, hi=j))

@LilithHafner
Copy link
Member

Thanks! sortperm has always been a weak link in Julia's sorting, performance-wise (#939).

sortperm and sortperm! allow the user to specify an algorithm and the docstring states "The permutation is guaranteed to be stable even if the sorting algorithm is unstable, meaning that indices of equal elements appear in ascending order." While Base only defines stable algorithms, some packages (e.g. SortingAlgorithms.jl) export unstable algorithms. We could

  1. Only use the non-stabilizing Perm type for algorithms we recognize, optionally define a trait to let external algorithms opt into declaring stability, and support fastpaths in e.g. IEEESortOptimization for both Perm and PermUstable.
  2. Use PkgEval to determine that this PR is okay as a "minor change". This is dicey because stability is subtle and may break things without being detected by tests.

I think it is possible to get substantially larger speedups by computing sortperm with a ZipVector-like data structure, but so long as PermUnstable is not exported, this change shouldn't interfere with that possibility. Indeed it synergizes well for cases when a ZipVector is not appropriate (small arrays, perhaps?).

@petvana
Copy link
Member Author

petvana commented Dec 6, 2022

@nanosoldier runbenchmarks("sort")

@petvana petvana marked this pull request as ready for review December 6, 2022 08:34
@nanosoldier
Copy link
Collaborator

Your job failed.

@petvana petvana changed the title WIP: Speed up permsort by utilizing stability of the default sorting algorithm Speed up permsort by utilizing stability of the default sorting algorithm Dec 6, 2022
@LilithHafner
Copy link
Member

What about sizes between 10 and 40 when we currently perform IEEEFloatOptimizations and then automatically dispatch to InsertionSort? I imagine the allocations could be expensive in that regime.

w.r.t nanosoldier, we know there is a problem, but we don't know what it is (#47795)

@petvana
Copy link
Member Author

petvana commented Dec 6, 2022

What about sizes between 10 and 40

Yes, there is a slight regression and we should address these allocations. (Btw, I would expect only a single allocation on master for sortperm and none for sortperm! if in-place InsertionSort is used.)

julia> @btime sortperm(x) setup=(x=rand(12)); # master
  531.895 ns (5 allocations: 368 bytes)

julia> @btime sortperm(x) setup=(x=rand(12)); # PR
  574.639 ns (7 allocations: 688 bytes)

@LilithHafner
Copy link
Member

LilithHafner commented Dec 6, 2022

I would expect only a single allocation on master for sortperm and none for sortperm! if in-place InsertionSort is used.

There is some type instability in sortperm (separated from the sorting by function barriers) This accounts for 3 of the allocations. Also, the dispatch to InsertionSort for sizes smaller than 40 is only present for UIntMappable types and orders, and Perm orderings are not UIntMappable so we go through the new allocating QuickSort. This accounts for the 4th allocation, and one allocation is expected. Unfortunately, there is a substantial regression in 1.9 compared to 1.8 on small sortperm. This would be covered by BaseBenchmarks after JuliaCI/BaseBenchmarks.jl#305 & after it stops failing. I'll open an issue.

@petvana
Copy link
Member Author

petvana commented Dec 6, 2022

I'll evaluate PR properly once allocation regression (#47811) is fixed to measure the influence of extra allocations and optimize it.

@petvana petvana marked this pull request as draft December 6, 2022 17:30
@petvana
Copy link
Member Author

petvana commented Dec 13, 2022

@nanosoldier runbenchmarks("sort")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - successfully executed benchmarks. A full report can be found here.

@petvana
Copy link
Member Author

petvana commented Dec 13, 2022

@nanosoldier runbenchmarks("sort", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

Copy link
Member

@LilithHafner LilithHafner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I benchmarked this with JuliaCI/BaseBenchmarks.jl#305 and found a minor regression for

@benchmark sort(x; by=x -> x isa Symbol ? (0, x) : (1, x)) setup=(x=[rand() < .5 ? randstring() : Symbol(randstring()) for _ in 1:30])

but several more substantial improvements to other benchmarks. It would be nice to know what causes the regression for non-concrete eltypes.

end
end

# This is similar to the partition function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have code re-use; the loop here is almost exactly the same as the loops in partition!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice. However, partition! left a pivot on the place and split to two parts with. Thus it seems challenging.

base/sort.jl Outdated Show resolved Hide resolved
base/sort.jl Outdated Show resolved Hide resolved
base/sort.jl Outdated
Preserves the order of the elements.
"""
function send_to_end_stable!(f::F, v::AbstractVector; lo=firstindex(v), hi=lastindex(v)) where F <: Function
tmp=copy(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this should utilize the scratch space re-use system, though it this PR is a performance improvement without, then that isn't essential.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to re-use the scratch array, but not sure if i got your @getkw concept right.

petvana and others added 3 commits December 14, 2022 09:21
@@ -603,19 +640,26 @@ function _sort!(v::AbstractVector, a::IEEEFloatOptimization, o::Ordering, kw)
j = send_to_end!(x -> after_zero(o, x), v; lo, hi)
scratch = _sort!(iv, a.next, Reverse, (;kw..., lo, hi=j))
if scratch === nothing # Union split
_sort!(iv, a.next, Forward, (;kw..., lo=j+1, hi, scratch))
_sort!(iv, a.next, Forward, (;kw..., lo=j+1, hi, nothing))
else
_sort!(iv, a.next, Forward, (;kw..., lo=j+1, hi, scratch))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LilithHafner Btw, is scratch type-stable here, i.e., can compiler infer that scratch cannot be Nothing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall that removing the if statement entirely introduced dynamic dispatch. IICU, all of type inference is an implementation detail, so I'm not totally sure, but I believe that we need an if statement to force union splitting though it works well whether we use _sort!(..., nothing) or _sort!(..., scratch) because the compiler can determine the type of scratch at compile time..

@petvana
Copy link
Member Author

petvana commented Dec 14, 2022

@nanosoldier runbenchmarks("sort", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@petvana
Copy link
Member Author

petvana commented Dec 15, 2022

The more I think about it, the more I'm convinced the whole Perm concept is inefficient because it generates an extreme number of cache misses. It can be easily beaten by one-liner sp for large arrays (without float-specific optimizations I believe).

julia> x = rand(10000000);

julia> sp(x) = sort(collect(a for a in zip(x, eachindex(x))), by = x -> x[1]) .|> x -> x[2]
sp (generic function with 1 method)

julia> @time sortperm(x); # This PR, already compiled
  1.389301 seconds (7 allocations: 152.588 MiB, 1.04% gc time)

julia> @time sp(x);
  1.182990 seconds (8 allocations: 534.058 MiB, 1.73% gc time)

julia> sp(x) == sortperm(x)
true

@LilithHafner
Copy link
Member

+1 for

The more I think about it, the more I'm convinced the whole Perm concept is inefficient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster sorting Put things in order
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants