-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bitonic mergesort #62
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! This algorithm is hot rigt now, but that doesn't mean that much. I'd still like to see a few benchmarks to demonstrate that there exist cases where it is the best algorithm available.
This implementation assumes one based indexing. Use firstindex
, lastindex
, and/or eachindex
to make it agnostic to indexing and to make the @inbounds
lookups not segfault on OffsetArray
s.
This implementation also seems more complex than it needs to be, but I don't have a substantially simpler implementation off the top of my head.
@@ -619,4 +640,54 @@ function sort!(v::AbstractVector, lo::Int, hi::Int, ::CombSortAlg, o::Ordering) | |||
return sort!(v, lo, hi, InsertionSort, o) | |||
end | |||
|
|||
function sort!(v::AbstractVector, lo::Int, hi::Int, ::BitonicSortAlg, o::Ordering) | |||
return bitonicsort!(view(v, lo:hi), o::Ordering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using views in this context sometimes comes with a runtime performance penalty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an alternative? This is only called once, btw, would it still be relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative used throughout SortingAlgorithms and Base.Sort is to carry around lo and hi everywhere. It is annoying but efficient. If you can get away with using views without performance penalty that would be great, but you'd need to benchmark both ways to find out.
Yes, it would still be relevant because all future indexing operations will be on a View
rather than a Vector
(or other input type)
src/SortingAlgorithms.jl
Outdated
|
||
function bitonicsort!(data, o::Ordering) | ||
N = length(data) | ||
for n in 1:ceil(Int, max(0, log2(N))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using leading_zeros
would be much more efficient here for BitIntegers. idk how pronounced the impact on the whole would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only runs once at the start, but thanks, I didn't know about that function!
function bitonicsort!(data, o::Ordering) | ||
N = length(data) | ||
for n in 1:ceil(Int, max(0, log2(N))) | ||
bitonicfirststage!(data, Val(n), o::Ordering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is type-unstable because the type of Val(n)
depends on something other than the input types of bitonicsort!
. It is possible but unlikely that it is a good idea to include type instability in a good sorting algorithm.
I would start with type-stable code and only switch to type-unstable if you can find a convincing reason to do so (in the case of loop unrolling, that would require end to end benchmarks showing the unstable version is faster).
Type instabilities also have disadvantages other than runtime performance (e.g. precompilation efficacy & static analysis). In this case, the first time someone sorts a very large list, there will be new compilation for a new n
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply force N and n to be integers somehow? And what makes it unstable in the first place?
The idea is indeed to be forcing compilation with knowledge of the input size, because this seems necessary to trigger compiler optimizations. I saw a lot more vectorization in the machine code with that, and I believe it won't be really interesting unless we do this.
We might have a limit above which we don't use compile-time-known values, if that's an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N
and n
are already integers (length
is documented to return an Integer), but that is not the issue. What makes Val(n)
unstable is that typeof(Val(4)) == Val{4} != typeof(Val(3))
. The compiler only knows about types, not values, so it knows the type of Val(n::Int)
is Val{???}
but it does not know what value n
holds, so it doesn't know the concrete type. This is in contrast with Set(n::Int)
which is of type Set{Int}
where the compiler can deduce the type of the result from the types of the inputs.
Forcing separate compilation for each step size is possible but should not be necessary to get SIMD. I originally took this approach in implementing radix sort in julia base to compile separately for every chunck size, but eventually realized that there wasn't any noticeable benefit from doing that and removed the optimization as it wasn't helpful. This is the commit where I removed it: JuliaLang/julia@17f45e8. The parent commit contains an example that avoids dynamic dispatch for small values by using explicit if statements. It is ugly IMO and I would only support it if it gave substantial speedups.
While it is possible to glean vlauble information from @code_llvm
and @code_native
outputs, it takes a great deal of understanding of CPU architectures to make accurate value judgements from them. I highly recommend benchmarking with @time
(or BenchmarkTools' @btime
) and profiling with Profile's @profile
(or VSCode's @profview
). Because those tools can quickly and easily point to potential problems and detect improvements or regressions.
For example, @profview for _ in 1:1000 sort!(rand(Int, 80); alg=BitonicSort) end
reveals that 95% of sorting time is spent in the bitonicsecondstage!
function and less than 1% is spent in the bitonicfirststage!
function. This indicates substantial a performance problem in the bitonicsecondstage!
function.
@profview for _ in 1:200000 sort!(rand(Int, 15); alg=BitonicSort) end
indicates that about 65% the sorting time is spent in the Val
function and in runtime dispatch, indicating a problem there, too.
src/SortingAlgorithms.jl
Outdated
a, b = v[ia + 1], v[ib + 1] | ||
v[ia+1], v[ib+1] = lt(o, b, a) ? (b, a) : (a, b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern appears thrice now (including once in combsort). It is probably a good idea to factor it out into a swap function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. A better name might be perhaps comparator
.
I think in previous versions of the code it was worth it to read the value before hand, and leave the re-assignment for later, but thankfully it's all really concentrated now, so we should do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comparator is a function that takes two (or sometimes three or more) inputs and determines which is larger. lt(order, _, _)
and isless
are comparators. This function also conditionally swaps the elements. There may be better names for it than swap
, but I don't like comparator
.
https://en.wikipedia.org/wiki/Comparator
https://clojure.org/guides/comparators
https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html
src/SortingAlgorithms.jl
Outdated
gap = 1 << n | ||
for i in 0:gap:N-1 | ||
lastj = min(N - 1, N - (i + gap >> 1 + 1)) | ||
for j in 0:lastj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When n = 1
, will this loop run Θ(N^2)
times? IIUC, n
will equal 1
Θ(log2(N)^2)
times which would give this algorithm a runtime of Ω(log(N)^2*N^2)
. That seems bad.
Perhaps the bounds of this loop should be tighter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still trying to understand it all, because in the sorting network literature it's all about the delay, or depth... On the wikipedia page, though, it clearly states the network has n log(n)^2
comparators, so I guess that should be the complexity. Not n log(n)
but not too shabby, and the main attraction is in parallelism anyways.
By eyeballing the diagram, or unrolling these two stages, we have this:
first_stage(data, 1)
first_stage(data, 2)
second_stage(data, 1)
first_stage(data, 3)
second_stage(data, 2)
second_stage(data, 1)
first_stage(data, 4)
second_stage(data, 3)
second_stage(data, 2)
second_stage(data, 1)
So it's log(n)
number of these larger blocks, but they increase linearly, making it O(log(n)^2)
function calls. Each of these are linear, so O(n log(n)^2)
. Do you agree? Notice it's linear because although it's two for-loops, the intervals are such that it's actually a linear traversal. We might even reshape
the input to implement this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that that is what the runtime should be, but I think there is an error in the implementation that results in much larger runtimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, you're right! There was an extra outer for-loop for the second stage... Hopefully that helps a bit, but I'm still skeptical this is as good as it gets. I mean, I'm not even sure it will really be worth it in the end, but I'm not sure we're at the best for this algorithm yet.
This algorithm performs a series of pre-determined comparisons, and tends to be very parallelizable. | ||
The algorithm effectively implements a sorting network based on merging bitonic sequences. | ||
|
||
Characteristics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have some characteristics that help people decide whether this algorithm is appropriate for their use case (e.g. when could it be better than the default sorting algorithms)
src/SortingAlgorithms.jl
Outdated
it performs many independent comparisons. | ||
|
||
## References | ||
- Batcher, K.E., (1968). "Sorting networks and their applications", AFIPS '68, doi: https://doi.org/10.1145/1468075.1468121. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This source claims on page 1 that it is possible to achieve a sorting network with depth (log2 n)*((log2 n)+1)/2 and comparisons n(log2 n)^2/4. Does this implementation achieve those values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, following the reasoning for the complexity I gave previously.
src/SortingAlgorithms.jl
Outdated
- *parallelizable* suitable for vectorization with SIMD instructions because | ||
it performs many independent comparisons. | ||
|
||
## References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.wikipedia.org/wiki/Bitonic_sorter is a great reference for this algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really a big fan of bloating the References sections, or citing Wikipedia, but fine by me if you think so.
fwiw, here's a first draft of an implementation that uses bit-twiddling to have simpler control flow (i.e. fewer and longer inner loops): using Base: Order
function bitonic_merge_sort!(v::AbstractVector, o::Ordering=Forward)
n = Int(length(v))
fi = firstindex(v)
for k in 0:8sizeof(n-1)-leading_zeros(n-1)-1
lo_mask = 1<<k-1
i = 0
while true
lo_bits = i & lo_mask
hi_bits = (i & ~lo_mask) << 1
hi = hi_bits + lo_bits + lo_mask + 1
hi >= n && break
lo = hi_bits + lo_bits ⊻ lo_mask
@inbounds swap!(v, lo+fi, hi+fi, o)
i += 1
end
for j in k-1:-1:0
lo_mask = 1<<j-1
i = 0
while true
lo = i & lo_mask + (i & ~lo_mask) << 1
hi = lo + lo_mask + 1
hi >= n && break
@inbounds swap!(v, lo+fi, hi+fi, o)
i += 1
end
end
end
v
end
Base.@propagate_inbounds function swap!(v, i, j, o)
a, b = v[i], v[j]
v[i], v[j] = lt(o, b, a) ? (b, a) : (a, b)
end
for _ in 1:1000
issorted(bitonic_merge_sort!(OffsetVector(rand(1:10,rand(1:10)),rand(-10:10)))) || error()
issorted(bitonic_merge_sort!(OffsetVector(rand(1:10,rand(1:10)),rand(-10:10)), Reverse), rev=true) || error()
end julia> @btime sort!(x) setup=(x=rand(Int, 80)) evals=1;
1.543 μs (0 allocations: 0 bytes)
julia> @btime bitonic_merge_sort!(x) setup=(x=rand(Int, 80)) evals=1;
1.752 μs (0 allocations: 0 bytes)
julia> @btime sort!(x; alg=BitonicSort) setup=(x=rand(Int, 80)) evals=1;
73.077 μs (28 allocations: 1.31 KiB) I have yet to find a case where it outperforms the default algorithm (though I happen to be running on a branch which has pretty good default algorithms) |
@LilithHafner that's great! I think I saw something similar in examples out there, but I didn't really understand until now... Have you checked if the loop vectorizations from the compiler are kicking in, though? How does it compare to the original code? Also I would expect that at least for very short lists it should perform great... |
I don't see any simd instructions in
About 50x faster than the PR code according to the benchmarks I posted above. Results ae similar for longer vectors but the default algorithms have asymttoic runtime of O(n log n) and O(n), so bitonic merge sort with O(n (log n)^2) runtime will probably only shine for small ish inputs. The benchmarks ae incredibly unfair, though, because the PR still has some major performance issues that can probably be fairly easily resolved (#62 (comment) & #62 (comment)). |
Codecov Report
@@ Coverage Diff @@
## master #62 +/- ##
==========================================
+ Coverage 96.51% 96.85% +0.33%
==========================================
Files 1 1
Lines 344 381 +37
==========================================
+ Hits 332 369 +37
Misses 12 12
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
I made some of the changes, hopefully I can look at your implementation later today in more detail. |
I agree, but I'm not sure how to call it. "comparator" is the term used in
the sorting networks literature, that's why I was considering it. Ideally
it should be a verb, but "compare" also suggests a predicate. Maybe
`sort2!(a, b, o)`?
…On Thu, 13 Oct 2022, 09:48 Lilith Orion Hafner, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/SortingAlgorithms.jl
<#62 (comment)>
:
> + a, b = v[ia + 1], v[ib + 1]
+ v[ia+1], v[ib+1] = lt(o, b, a) ? (b, a) : (a, b)
A comparator is a function that takes two (or sometimes three or more)
inputs and determines which is larger. lt(order, _, _) and isless are
comparators. This function also conditionally swaps the elements. There may
be better names for it than swap, but I don't like comparator.
https://en.wikipedia.org/wiki/Comparator
https://clojure.org/guides/comparators
https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html
—
Reply to this email directly, view it on GitHub
<#62 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABHKESPAL2TQACF5Z6NGJTWC65DJANCNFSM6AAAAAARCTT5YM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
|
I've tried some benchmarking, with nasty results. I guess similar to what you observed. I find it strange, though, that I couldn't get reasonable times even for small inputs. I think I'll try something like generator functions or "hand-crafted" implementations with simd just to see if there's any potential and find a clue for what's up going on... |
Co-authored-by: Lilith Orion Hafner <lilithhafner@gmail.com>
Here's an implementation in Julia: https://github.com/JuliaArrays/StaticArrays.jl/blob/v1.5.9/src/sort.jl#L50-L89 |
Nice! I don't think I had seen that before. I guess generated functions are really the way to go. In the comments they say only inputs up to a certain size are accepted because of the sub-optimal complexity, but I'd argue that because of parallelism it's hard to know for sure where is that point, what is also supported by the good performance of combsort. Maybe we could consider to keep working on a more generic version to see how it goes. We probably need generated functions to get good machine code, though. I wonder if there's in the literature already some kind of modification to combsort to make it look more like bitonic merge sort. The main difference I see, apart from the fact the sub-list sizes are all in powers of two, is that the intervals grow first, then shrink, then grow again. If we could prove that any of these things puts a hard limit on the maximum distance of an entry to its final position, that would be it, I think. |
Implements Batcher's bitonic mergesort. This algorithm effectively implements a sorting network, but can also be understood as a sorting algorithm.
Basing ourselves on the layout explained in this Wikipedia section:
https://en.wikipedia.org/wiki/Bitonic_sorter#Alternative_representation
it becomes simple to implement a version that works with inputs of arbitrary lengths. We just need to assume
inf
for all the complementary entries that would make the array size a power-of-two, and simply skip them in the comparisons.The functions have been implemented forcing some variables to be
Val
, so that loop vectorizations are more likely to be triggered by the compiler. This seems challenging otherwise, especially for the first step with the backwards iteration.No experiments have been made yet regarding speed, but it seems good vectorized code is being produced at least in some cases, and tests are already passing, so I thought I could go ahead and get some feedback already.
This PR was inspired by the previous discussion with @LilithHafner in #54.