Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add radix sort #44230

Merged
merged 103 commits into from
Apr 3, 2022
Merged

Add radix sort #44230

merged 103 commits into from
Apr 3, 2022

Conversation

LilithHafner
Copy link
Member

@LilithHafner LilithHafner commented Feb 17, 2022

I've implemented radix sort and integrated it into the Julia sorting architecture. Micro-benchmarking reports a substantial speedup in many cases and minor regressions in a few cases. I hope that with a bit of collaboration, we can minimize those regressions as well.

This grows out of a discussion on discourse and a PR to SortingAlgorithm.jl. It is inspired in part by the radix sort algorithm in SortingAlgorithms.jl. This is different from the radix sort algorithm in SortingAlgorithms.jl because it is substantially faster in most cases and more easily extensible.

This PR is based on a fast radix sort for unsigned integers in increasing order. To sort other types and orders—currently all signed and unsigned integers, Float32, Float64, and Char, for Forward and Reverse orderings are supported—we have a function Sort.Serial.serialize that converts objects to unsigned integers which are sorted and then the original objects are reconstructed via Sort.Serial.deserialize. These two functions come with a trait Sort.Serial.Serializable that indicates whether a type is serializable for a given ordering. If a package defines a custom type that can be sorted via radix sort, it may either a) do nothing and we will continue to dispatch to a comparison sort at compile time or b) define these three functions and we will dispatch to radix sort when appropriate.

I have left out many features and minor optimizations that have come up in discussion surrounding this algorithm, preferring simplicity over minor performance benefits. Those optimizations can be added now, later, or never.

A small sample of benchmarking vs. mar 1 master (3a5dc09)

@benchmark sort!(x) setup=(x=rand(1:1, 1)) evals=1 # 50%, 48% speedup
@benchmark sort!(x) setup=(x=rand(1:30, 30)) evals=1 # 36%, 28% speedup
@benchmark sort!(x) setup=(x=rand(1:1000, 1000)) evals=1 # 479%, 450% speedup
@benchmark sort!(x) setup=(x=rand(1:30000, 30000)) evals=1 # 635%, 793% speedup
@benchmark sort!(x) setup=(x=rand(1:1000000, 1000000)) evals=1 # 648%, 642% speedup

@benchmark sort!(x) setup=(x=rand(1)) evals=1 # 71%, 17% speedup
@benchmark sort!(x) setup=(x=rand(10)) evals=1 # 6%, 9% speedup
@benchmark sort!(x) setup=(x=rand(30)) evals=1 # 54%, 14% speedup
@benchmark sort!(x) setup=(x=rand(1000)) evals=1 # 76%, 73% speedup
@benchmark sort!(x) setup=(x=rand(30000)) evals=1 # 121%, 119% speedup
@benchmark sort!(x) setup=(x=rand(1000000)) evals=1 # 139%, 144% speedup

@benchmark sort!(x) setup=(x=rand(Char, 1)) evals=1 # 75%, 155% speedup
@benchmark sort!(x) setup=(x=rand(Char, 30)) evals=1 # 32%, 86% speedup
@benchmark sort!(x) setup=(x=rand(Char, 1000)) evals=1 # 272%, 284% speedup
@benchmark sort!(x) setup=(x=rand(Char, 30000)) evals=1 # 771%, 873% speedup
@benchmark sort!(x) setup=(x=rand(Char, 1000000)) evals=1 # 1002%, 965% speedup


@benchmark sort!(x) setup=(x=rand(Bool, 10^6)) evals=1 # 1843%, 2123% speedup
@benchmark sort!(x) setup=(x=rand(Int8, 10^6)) evals=1 # 5368%, 5272% speedup
@benchmark sort!(x) setup=(x=rand(UInt8, 10^6)) evals=1 # 5275%, 5162% speedup

@benchmark sort!(x) setup=(x=rand(Int16, 10^6)) evals=1 # 2204%, 2165% speedup
@benchmark sort!(x) setup=(x=rand(UInt16, 10^6)) evals=1 # 2197%, 2140% speedup
@benchmark sort!(x) setup=(x=rand(Float16, 10^6)) evals=1 # 2%, -1% speedup

@benchmark sort!(x) setup=(x=rand(Int32, 10^6)) evals=1 # 567%, 548% speedup
@benchmark sort!(x) setup=(x=rand(UInt32, 10^6)) evals=1 # 524%, 525% speedup
@benchmark sort!(x) setup=(x=rand(Char, 10^6)) evals=1 # 1001%, 944% speedup
@benchmark sort!(x) setup=(x=rand(Float32, 10^6)) evals=1 # 491%, 464% speedup

@benchmark sort!(x) setup=(x=rand(Int64, 10^6)) evals=1 # 120%, 118% speedup
@benchmark sort!(x) setup=(x=rand(UInt64, 10^6)) evals=1 # 104%, 106% speedup
@benchmark sort!(x) setup=(x=rand(Float64, 10^6)) evals=1 # 141%, 142% speedup

@benchmark sort!(x) setup=(x=rand(Int128, 10^6)) evals=1 # 12%, 6% speedup
@benchmark sort!(x) setup=(x=rand(UInt128, 10^6)) evals=1 # 1%, -2% speedup

(run 2 times, speedups computed by dividing minimum time on master by minimum time on ffcc043)

Problems & todos

  • @benchmark sort!(x) setup=(x=rand(1000)) is 300ns slower for radix sort than quick sort.
    It should immediately dispatch to quick sort based only on length, so it is unclear why there
    is such a substantial slowdown. This is probably related to the 100% slowdown for sorting
    single floats. (addressed in Add radix sort #44230 (comment))

  • I've run millions of tests with SortMark.jl, but this still needs efficient comprehensive
    testing in Base, and more hand crafted tests may still find bugs.

  • Sort.Serial is a bit of a name conflict with the existing package Serialization

  • Perhaps we should support passing in the temporary array. Is that performance feature used anywhere?

  • The heuristics right now are pretty primitive. Perhaps they can be improved or tuned.

  • Potentially rework integration with fpsort!

  • Perhaps we should integrate this with sortperm, right now it is incompatible, but with some changes, it could work.

base/sort.jl Outdated Show resolved Hide resolved
@oscardssmith oscardssmith added the performance Must go faster label Feb 17, 2022
@LilithHafner
Copy link
Member Author

LilithHafner commented Feb 17, 2022

The build error is LoadError(at "compiler/compiler.jl" line 3: LoadError(at "sort.jl" line 672: UndefVarError(var=:Val))). Is this because Val is not available until after sort.jl in bootstrapping? Is it advisable to introduce RadixSort later on (I doubt we sort large arrays during bootstrapping)? (EDIT: resolved. This is my first time making a substantive addition to Base, and I forgot to add using ..Base: everything, i, use, from, base, even, if, exported. Val is defined very early in essentials.jl.)

@quinnj
Copy link
Member

quinnj commented Feb 17, 2022

Yeah, this is exciting. I'll link to my own implementation that lives in the InlineStrings.jl package.

Some notable changes I made to the SortingAlgorithms.jl implementation:

  • It operates on any "primitive" type, so no specific restrictions on integer vs float vs anything else
  • I did some extensive benchmarking around the # of bits of the element type and at what point radix sort was suboptimal, findings here
  • I split out a Radix struct that makes it easy to play around with different radix sizes and stuff, which was helpful in benchmarking
  • Lots of code comments to explain what everything does! It's a trickier algorithm w/ lots of indirection, so I found it helpful to keep track of what's going on.

For InlineStrings.jl, there are fixed-size string types from 1-byte all the way up to 255-bytes, so there were a lot of permutations of performance to check; here's one sample graph of the String7 type.

Anyway, excited to see the progress in this direction!
image

@LilithHafner
Copy link
Member Author

@quinnj's thresholds (64 bits -> 500 & 128 bits -> 2 million) line up okay with the thresholds I've chosen. I expect that the thresholds and heuristics in this PR could be tuned much more preciesly. If you would like to play around with thresholds, everything runs as is with Revise, I think it's just not building from source with the PR.

@quinnj, are there any changes you would recommend here based on your experience with InlineStrings?

@LilithHafner
Copy link
Member Author

LilithHafner commented Feb 20, 2022

Hooray! Everything but buildkite passed on 7b665ef

@LilithHafner
Copy link
Member Author

I'm giving up on understanding the ci errors and switching to blind bisection. I'm going to start by rebasing to the latest master and reverting everything.

@nalimilan
Copy link
Member

Great! Looks good to me too.

It could make sense to use the same threshold for switching to insertion sort (currently 40) and for checking issorted (currently 200). If the overhead of issorted is < 1% for 200 elements, I guess it's < 5% for 40 elements? The advantage could be that 1) the rules are simpler, and 2) we are always fast for pre-sorted inputs, since InsertionSort is fast for them.

Adding a check for reverse sorting also sounds like a good idea to me, as long as we use a high enough threshold to keep the overhead < 1%. I imagine this kind of pattern can happen in real life if you sort data in one direction and then in the other one, and making this up to 7× slower than it could would be too bad.

BTW, it would be nice to keep these benchmarks somewhere. Maybe you make a PR against BaseBenchmarks after this one is merged?

test/sorting.jl Outdated Show resolved Hide resolved
test/sorting.jl Outdated Show resolved Hide resolved
base/sort.jl Outdated Show resolved Hide resolved
base/sort.jl Outdated Show resolved Hide resolved
LilithHafner and others added 2 commits March 31, 2022 16:17
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@LilithHafner
Copy link
Member Author

The advantage could be that 1) the rules are simpler, and 2) we are always fast for pre-sorted inputs, since InsertionSort is fast for them.

For adjusting the minimum presorted input threshold, I find your reasoning quite convincing. I like how you provide a theoretical consideration (1) and then back it up with a practical example of why this matters (2: instead of three cases with a bad middle, there are only two cases).

For reverse sorted input, this makes the algorithm marginally more complicated, though the runtime cost-benefit is huge. This also makes a nice pattern where very short (< 40) reverse-sorted input is pathological for insertion sort, medium-length (40 < len < 500) is handled decently by radix sort, and long (> 500) is handled perfectly with a reverse sorted check. Intuitively, as length increases, special-case detection improves. I added it.

@LilithHafner
Copy link
Member Author

BTW, it would be nice to keep these benchmarks somewhere. Maybe you make a PR against BaseBenchmarks after this one is merged?

The benchmarks here are pretty similar to the benchmarks already in BaseBenchmarks/sort. The difference is that these span more input distributions, more types, more lengths, and longer lengths, and those at BaseBenchmarks cover sorting functions other than sort!. How do these tests cover a wider area? they take longer to run. These tests take a few hours to run. I might contribute a sparse cross product akin to what I've done in SortMark.jl, but it's not trivial to migrate these benchmarks to BaseBenchmarks without bringing along too much bloat.

@LilithHafner
Copy link
Member Author

Okay, this looks good to me too. Thanks so much, @nalimilan, @oscardssmith, and everyone else who has had conversations and helped along the way!

@oscardssmith
Copy link
Member

thanks so much for setting this through!

@nalimilan
Copy link
Member

Thanks @LilithHafner! Don't hesitate if you feel like extending radix sort to Union{T, Missing} vectors. :-)

@StefanKarpinski
Copy link
Member

Brava! Excellent work here.

Thought on the issorted check: we could introduce a sortcheck(v) utility function that returns 1 or 0 or -1 to indicate sorted, reverse sorted or neither. It would short-circuit, of course. I suspect that could be as fast as the issorted check.

@LilithHafner
Copy link
Member Author

Thanks!

I'm having trouble coming up with an implementation of sortcheck that is substantially faster than two issorted checks.

@StefanKarpinski
Copy link
Member

A little rough benchmarking seems to find that this can be substantially more efficient in some cases:

function sortcheck(v::Vector, lt=isless)
    cmp::UInt8 = 0b11
    if !isempty(v)
        y = v[1]
        @inbounds for i = 2:length(v)
            x = v[i]
            cmp &= 2*!lt(x, y) + !lt(y, x)
            iszero(cmp) && break
            y = x
        end
    end
    !iszero(cmp)
end

The worst case for issorted(v) || issorted(v, rev=true) is something like this:

julia> v = [fill(0, 1000); -1; 1];

julia> @btime issorted($v) || issorted($v, rev=true)
  990.154 ns (0 allocations: 0 bytes)
false

julia> @btime sortcheck($v)
  678.273 ns (0 allocations: 0 bytes)
false

In this case two issorted checks both have to scan through the entire array even though at the end of the first check it has already seen the writing on the wall (but not realized it). sortcheck does more work on each loop iteration but only scans the input once. It could probably be improved by eliminating that conditional break and allowing vectorization.

@StefanKarpinski
Copy link
Member

Sure enough, this slightly incorrect 8x unrolling of the inner loop is 2x faster:

function sortcheck(v::Vector, lt=isless)
    cmp::UInt8 = 0b11
    if !isempty(v)
        y = v[1]
        @inbounds for i = 2:8:length(v)
            for j = 0:7
                x = v[i+j]
                cmp &= 2*!lt(x, y) + !lt(y, x)
                y = x
            end
            iszero(cmp) && break
        end
    end
    !iszero(cmp)
end

If the end of the loop was handled correctly, this could work. Paying attention to memory alignment might improve the speed more. I do wish the compiler could figure that transformation out based on the fact that the inner loop has no side effects and once cmp becomes zero it can't come back, but I suppose that's asking a lot of LLVM.

@oscardssmith
Copy link
Member

The compiler problem is very similar to vectorizing findfirst automatically which would also be really nice.

@LilithHafner
Copy link
Member Author

I'm more concerned with runtime on entirely unsorted input than on input with long sorted prefixes. On random data, the latest sortcheck is fairly expensive:

@belapsed((sort!(x); sort(x; rev=true)), setup=(x=rand(40)))/2 #1.6455882352941177e-7
@belapsed issorted(x) setup=(x=rand(40)) #3.382e-9    2%
@belapsed sortcheck(x) setup=(x=rand(40)) #1.292778335005015e-8  8%

@StefanKarpinski
Copy link
Member

I was trying to make something faster than two issorted checks, which this is. But yes, it's still slower than just one issorted check, although only 6% of sort time on my system. Why are you sorting the data twice in your benchmark?

@LilithHafner
Copy link
Member Author

I'm sorting twice and dividing time by 2 because otherwise it would be repeatedly resorting sorted data, giving an unrealistically fast sorting time. On random data on my system, I'm measuring sortcheck running 2.6x slower than two issorted checks.

@belapsed((sort!(x); sort(x; rev=true)), setup=(x=rand(40)))/2 #1.684660633484163e-7
@belapsed issorted(x) setup=(x=rand(40)) #3.383e-9.                 2%
@belapsed issorted(x) || issorted(x, Reverse) setup=(x=rand(40)) #4.9079079079079075e-9  3%
@belapsed sortcheck(x) setup=(x=rand(40)) #1.2937813440320962e-8    8%
@belapsed sort!(x) setup=(x=rand(40)) #1.1508637873754152e-7        70%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks sorting Put things in order
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants