Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitonic mergesort #62

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Bitonic mergesort #62

wants to merge 5 commits into from

Conversation

nlw0
Copy link
Contributor

@nlw0 nlw0 commented Oct 11, 2022

Implements Batcher's bitonic mergesort. This algorithm effectively implements a sorting network, but can also be understood as a sorting algorithm.

Basing ourselves on the layout explained in this Wikipedia section:
https://en.wikipedia.org/wiki/Bitonic_sorter#Alternative_representation
it becomes simple to implement a version that works with inputs of arbitrary lengths. We just need to assume inf for all the complementary entries that would make the array size a power-of-two, and simply skip them in the comparisons.

The functions have been implemented forcing some variables to be Val, so that loop vectorizations are more likely to be triggered by the compiler. This seems challenging otherwise, especially for the first step with the backwards iteration.

No experiments have been made yet regarding speed, but it seems good vectorized code is being produced at least in some cases, and tests are already passing, so I thought I could go ahead and get some feedback already.

This PR was inspired by the previous discussion with @LilithHafner in #54.

Copy link
Member

@LilithHafner LilithHafner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! This algorithm is hot rigt now, but that doesn't mean that much. I'd still like to see a few benchmarks to demonstrate that there exist cases where it is the best algorithm available.

This implementation assumes one based indexing. Use firstindex, lastindex, and/or eachindex to make it agnostic to indexing and to make the @inbounds lookups not segfault on OffsetArrays.

This implementation also seems more complex than it needs to be, but I don't have a substantially simpler implementation off the top of my head.

@@ -619,4 +640,54 @@ function sort!(v::AbstractVector, lo::Int, hi::Int, ::CombSortAlg, o::Ordering)
return sort!(v, lo, hi, InsertionSort, o)
end

function sort!(v::AbstractVector, lo::Int, hi::Int, ::BitonicSortAlg, o::Ordering)
return bitonicsort!(view(v, lo:hi), o::Ordering)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using views in this context sometimes comes with a runtime performance penalty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alternative? This is only called once, btw, would it still be relevant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative used throughout SortingAlgorithms and Base.Sort is to carry around lo and hi everywhere. It is annoying but efficient. If you can get away with using views without performance penalty that would be great, but you'd need to benchmark both ways to find out.

Yes, it would still be relevant because all future indexing operations will be on a View rather than a Vector (or other input type)


function bitonicsort!(data, o::Ordering)
N = length(data)
for n in 1:ceil(Int, max(0, log2(N)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using leading_zeros would be much more efficient here for BitIntegers. idk how pronounced the impact on the whole would be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only runs once at the start, but thanks, I didn't know about that function!

function bitonicsort!(data, o::Ordering)
N = length(data)
for n in 1:ceil(Int, max(0, log2(N)))
bitonicfirststage!(data, Val(n), o::Ordering)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is type-unstable because the type of Val(n) depends on something other than the input types of bitonicsort!. It is possible but unlikely that it is a good idea to include type instability in a good sorting algorithm.

I would start with type-stable code and only switch to type-unstable if you can find a convincing reason to do so (in the case of loop unrolling, that would require end to end benchmarks showing the unstable version is faster).

Type instabilities also have disadvantages other than runtime performance (e.g. precompilation efficacy & static analysis). In this case, the first time someone sorts a very large list, there will be new compilation for a new n.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply force N and n to be integers somehow? And what makes it unstable in the first place?

The idea is indeed to be forcing compilation with knowledge of the input size, because this seems necessary to trigger compiler optimizations. I saw a lot more vectorization in the machine code with that, and I believe it won't be really interesting unless we do this.

We might have a limit above which we don't use compile-time-known values, if that's an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N and n are already integers (length is documented to return an Integer), but that is not the issue. What makes Val(n) unstable is that typeof(Val(4)) == Val{4} != typeof(Val(3)). The compiler only knows about types, not values, so it knows the type of Val(n::Int) is Val{???} but it does not know what value n holds, so it doesn't know the concrete type. This is in contrast with Set(n::Int) which is of type Set{Int} where the compiler can deduce the type of the result from the types of the inputs.

Forcing separate compilation for each step size is possible but should not be necessary to get SIMD. I originally took this approach in implementing radix sort in julia base to compile separately for every chunck size, but eventually realized that there wasn't any noticeable benefit from doing that and removed the optimization as it wasn't helpful. This is the commit where I removed it: JuliaLang/julia@17f45e8. The parent commit contains an example that avoids dynamic dispatch for small values by using explicit if statements. It is ugly IMO and I would only support it if it gave substantial speedups.

While it is possible to glean vlauble information from @code_llvm and @code_native outputs, it takes a great deal of understanding of CPU architectures to make accurate value judgements from them. I highly recommend benchmarking with @time (or BenchmarkTools' @btime) and profiling with Profile's @profile (or VSCode's @profview). Because those tools can quickly and easily point to potential problems and detect improvements or regressions.

For example, @profview for _ in 1:1000 sort!(rand(Int, 80); alg=BitonicSort) end reveals that 95% of sorting time is spent in the bitonicsecondstage! function and less than 1% is spent in the bitonicfirststage! function. This indicates substantial a performance problem in the bitonicsecondstage! function.

@profview for _ in 1:200000 sort!(rand(Int, 15); alg=BitonicSort) end indicates that about 65% the sorting time is spent in the Val function and in runtime dispatch, indicating a problem there, too.

src/SortingAlgorithms.jl Outdated Show resolved Hide resolved
Comment on lines 668 to 669
a, b = v[ia + 1], v[ib + 1]
v[ia+1], v[ib+1] = lt(o, b, a) ? (b, a) : (a, b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern appears thrice now (including once in combsort). It is probably a good idea to factor it out into a swap function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. A better name might be perhaps comparator.

I think in previous versions of the code it was worth it to read the value before hand, and leave the re-assignment for later, but thankfully it's all really concentrated now, so we should do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comparator is a function that takes two (or sometimes three or more) inputs and determines which is larger. lt(order, _, _) and isless are comparators. This function also conditionally swaps the elements. There may be better names for it than swap, but I don't like comparator.

https://en.wikipedia.org/wiki/Comparator
https://clojure.org/guides/comparators
https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html

gap = 1 << n
for i in 0:gap:N-1
lastj = min(N - 1, N - (i + gap >> 1 + 1))
for j in 0:lastj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When n = 1, will this loop run Θ(N^2) times? IIUC, n will equal 1 Θ(log2(N)^2) times which would give this algorithm a runtime of Ω(log(N)^2*N^2). That seems bad.

Perhaps the bounds of this loop should be tighter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to understand it all, because in the sorting network literature it's all about the delay, or depth... On the wikipedia page, though, it clearly states the network has n log(n)^2 comparators, so I guess that should be the complexity. Not n log(n) but not too shabby, and the main attraction is in parallelism anyways.

By eyeballing the diagram, or unrolling these two stages, we have this:

first_stage(data, 1)

first_stage(data, 2)
second_stage(data, 1)

first_stage(data, 3)
second_stage(data, 2)
second_stage(data, 1)

first_stage(data, 4)
second_stage(data, 3)
second_stage(data, 2)
second_stage(data, 1)

So it's log(n) number of these larger blocks, but they increase linearly, making it O(log(n)^2) function calls. Each of these are linear, so O(n log(n)^2). Do you agree? Notice it's linear because although it's two for-loops, the intervals are such that it's actually a linear traversal. We might even reshape the input to implement this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that that is what the runtime should be, but I think there is an error in the implementation that results in much larger runtimes.

Copy link
Contributor Author

@nlw0 nlw0 Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you're right! There was an extra outer for-loop for the second stage... Hopefully that helps a bit, but I'm still skeptical this is as good as it gets. I mean, I'm not even sure it will really be worth it in the end, but I'm not sure we're at the best for this algorithm yet.

src/SortingAlgorithms.jl Outdated Show resolved Hide resolved
This algorithm performs a series of pre-determined comparisons, and tends to be very parallelizable.
The algorithm effectively implements a sorting network based on merging bitonic sequences.

Characteristics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have some characteristics that help people decide whether this algorithm is appropriate for their use case (e.g. when could it be better than the default sorting algorithms)

it performs many independent comparisons.

## References
- Batcher, K.E., (1968). "Sorting networks and their applications", AFIPS '68, doi: https://doi.org/10.1145/1468075.1468121.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This source claims on page 1 that it is possible to achieve a sorting network with depth (log2 n)*((log2 n)+1)/2 and comparisons n(log2 n)^2/4. Does this implementation achieve those values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, following the reasoning for the complexity I gave previously.

- *parallelizable* suitable for vectorization with SIMD instructions because
it performs many independent comparisons.

## References
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/Bitonic_sorter is a great reference for this algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a big fan of bloating the References sections, or citing Wikipedia, but fine by me if you think so.

@LilithHafner
Copy link
Member

fwiw, here's a first draft of an implementation that uses bit-twiddling to have simpler control flow (i.e. fewer and longer inner loops):

using Base: Order

function bitonic_merge_sort!(v::AbstractVector, o::Ordering=Forward)
    n = Int(length(v))
    fi = firstindex(v)
    for k in 0:8sizeof(n-1)-leading_zeros(n-1)-1
        lo_mask = 1<<k-1
        i = 0
        while true
            lo_bits = i & lo_mask
            hi_bits = (i & ~lo_mask) << 1
            hi = hi_bits + lo_bits + lo_mask + 1
            hi >= n && break
            lo = hi_bits + lo_bits  lo_mask
            @inbounds swap!(v, lo+fi, hi+fi, o)
            i += 1
        end
        for j in k-1:-1:0
            lo_mask = 1<<j-1
            i = 0
            while true
                lo = i & lo_mask + (i & ~lo_mask) << 1
                hi = lo + lo_mask + 1
                hi >= n && break
                @inbounds swap!(v, lo+fi, hi+fi, o)
                i += 1
            end
        end
    end
    v
end

Base.@propagate_inbounds function swap!(v, i, j, o)
    a, b = v[i], v[j]
    v[i], v[j] = lt(o, b, a) ? (b, a) : (a, b)
end

for _ in 1:1000
    issorted(bitonic_merge_sort!(OffsetVector(rand(1:10,rand(1:10)),rand(-10:10)))) || error()
    issorted(bitonic_merge_sort!(OffsetVector(rand(1:10,rand(1:10)),rand(-10:10)), Reverse), rev=true) || error()
end
julia> @btime sort!(x) setup=(x=rand(Int, 80)) evals=1;
  1.543 μs (0 allocations: 0 bytes)

julia> @btime bitonic_merge_sort!(x) setup=(x=rand(Int, 80)) evals=1;
  1.752 μs (0 allocations: 0 bytes)

julia> @btime sort!(x; alg=BitonicSort) setup=(x=rand(Int, 80)) evals=1;
  73.077 μs (28 allocations: 1.31 KiB)

I have yet to find a case where it outperforms the default algorithm (though I happen to be running on a branch which has pretty good default algorithms)

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 12, 2022

@LilithHafner that's great! I think I saw something similar in examples out there, but I didn't really understand until now... Have you checked if the loop vectorizations from the compiler are kicking in, though? How does it compare to the original code? Also I would expect that at least for very short lists it should perform great...

@LilithHafner
Copy link
Member

Have you checked if the loop vectorizations from the compiler are kicking in, though?

I don't see any simd instructions in @code_llvm bitonic_merge_sort!([1,2,3], Forward).

How does it compare to the original code?

About 50x faster than the PR code according to the benchmarks I posted above. Results ae similar for longer vectors but the default algorithms have asymttoic runtime of O(n log n) and O(n), so bitonic merge sort with O(n (log n)^2) runtime will probably only shine for small ish inputs. The benchmarks ae incredibly unfair, though, because the PR still has some major performance issues that can probably be fairly easily resolved (#62 (comment) & #62 (comment)).

@codecov-commenter
Copy link

codecov-commenter commented Oct 13, 2022

Codecov Report

Merging #62 (d4c23d9) into master (80c14f5) will increase coverage by 0.33%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #62      +/-   ##
==========================================
+ Coverage   96.51%   96.85%   +0.33%     
==========================================
  Files           1        1              
  Lines         344      381      +37     
==========================================
+ Hits          332      369      +37     
  Misses         12       12              
Impacted Files Coverage Δ
src/SortingAlgorithms.jl 96.85% <100.00%> (+0.33%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 13, 2022

I made some of the changes, hopefully I can look at your implementation later today in more detail.

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 13, 2022 via email

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 13, 2022

I've tried some benchmarking, with nasty results. I guess similar to what you observed. I find it strange, though, that I couldn't get reasonable times even for small inputs. I think I'll try something like generator functions or "hand-crafted" implementations with simd just to see if there's any potential and find a clue for what's up going on...

nlw0 and others added 2 commits October 14, 2022 08:11
Co-authored-by: Lilith Orion Hafner <lilithhafner@gmail.com>
@LilithHafner
Copy link
Member

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 25, 2022

Nice! I don't think I had seen that before. I guess generated functions are really the way to go.

In the comments they say only inputs up to a certain size are accepted because of the sub-optimal complexity, but I'd argue that because of parallelism it's hard to know for sure where is that point, what is also supported by the good performance of combsort. Maybe we could consider to keep working on a more generic version to see how it goes. We probably need generated functions to get good machine code, though.

I wonder if there's in the literature already some kind of modification to combsort to make it look more like bitonic merge sort. The main difference I see, apart from the fact the sub-list sizes are all in powers of two, is that the intervals grow first, then shrink, then grow again. If we could prove that any of these things puts a hard limit on the maximum distance of an entry to its final position, that would be it, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants