Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitonic mergesort #62

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 80 additions & 4 deletions src/SortingAlgorithms.jl
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,13 @@ using Base.Order
import Base.Sort: sort!
import DataStructures: heapify!, percolate_down!

export HeapSort, TimSort, RadixSort, CombSort
export HeapSort, TimSort, RadixSort, CombSort, BitonicSort

struct HeapSortAlg <: Algorithm end
struct TimSortAlg <: Algorithm end
struct RadixSortAlg <: Algorithm end
struct CombSortAlg <: Algorithm end
struct BitonicSortAlg <: Algorithm end

const HeapSort = HeapSortAlg()
const TimSort = TimSortAlg()
Expand Down Expand Up @@ -46,6 +47,27 @@ Characteristics:
"""
const CombSort = CombSortAlg()

"""
BitonicSort

Indicates that a sorting function should use the bitonic mergesort algorithm.
This algorithm performs a series of pre-determined comparisons, and tends to be very parallelizable.
The algorithm effectively implements a sorting network based on merging bitonic sequences.

Characteristics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have some characteristics that help people decide whether this algorithm is appropriate for their use case (e.g. when could it be better than the default sorting algorithms)

- *not stable* does not preserve the ordering of elements which
compare equal (e.g. "a" and "A" in a sort of letters which
ignores case).
- *in-place* in memory.
- *parallelizable* suitable for vectorization with SIMD instructions because
it performs many independent comparisons.

## References
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/Bitonic_sorter is a great reference for this algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a big fan of bloating the References sections, or citing Wikipedia, but fine by me if you think so.

- Batcher, K.E., "Sorting networks and their applications", AFIPS '68 (1968), doi: https://doi.org/10.1145/1468075.1468121.
- "Bitonic Sorter", in "Wikipedia" (Oct 2022). https://en.wikipedia.org/wiki/Bitonic_sorter
nlw0 marked this conversation as resolved.
Show resolved Hide resolved
"""
const BitonicSort = BitonicSortAlg()


## Heap sort

Expand Down Expand Up @@ -609,14 +631,68 @@ function sort!(v::AbstractVector, lo::Int, hi::Int, ::CombSortAlg, o::Ordering)
interval = (3 * (hi-lo+1)) >> 2

while interval > 1
@inbounds for j in lo:hi-interval
a, b = v[j], v[j+interval]
v[j], v[j+interval] = lt(o, b, a) ? (b, a) : (a, b)
for j in lo:hi-interval
@inbounds comparator!(v, j, j+interval, o)
end
interval = (3 * interval) >> 2
end

return sort!(v, lo, hi, InsertionSort, o)
end

function sort!(v::AbstractVector, lo::Int, hi::Int, ::BitonicSortAlg, o::Ordering)
return bitonicsort!(view(v, lo:hi), o::Ordering)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using views in this context sometimes comes with a runtime performance penalty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alternative? This is only called once, btw, would it still be relevant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative used throughout SortingAlgorithms and Base.Sort is to carry around lo and hi everywhere. It is annoying but efficient. If you can get away with using views without performance penalty that would be great, but you'd need to benchmark both ways to find out.

Yes, it would still be relevant because all future indexing operations will be on a View rather than a Vector (or other input type)

end

function bitonicsort!(data, o::Ordering)
N = length(data)
for n in 1:intlog2(N)
bitonicfirststage!(data, Val(n), o::Ordering)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is type-unstable because the type of Val(n) depends on something other than the input types of bitonicsort!. It is possible but unlikely that it is a good idea to include type instability in a good sorting algorithm.

I would start with type-stable code and only switch to type-unstable if you can find a convincing reason to do so (in the case of loop unrolling, that would require end to end benchmarks showing the unstable version is faster).

Type instabilities also have disadvantages other than runtime performance (e.g. precompilation efficacy & static analysis). In this case, the first time someone sorts a very large list, there will be new compilation for a new n.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply force N and n to be integers somehow? And what makes it unstable in the first place?

The idea is indeed to be forcing compilation with knowledge of the input size, because this seems necessary to trigger compiler optimizations. I saw a lot more vectorization in the machine code with that, and I believe it won't be really interesting unless we do this.

We might have a limit above which we don't use compile-time-known values, if that's an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N and n are already integers (length is documented to return an Integer), but that is not the issue. What makes Val(n) unstable is that typeof(Val(4)) == Val{4} != typeof(Val(3)). The compiler only knows about types, not values, so it knows the type of Val(n::Int) is Val{???} but it does not know what value n holds, so it doesn't know the concrete type. This is in contrast with Set(n::Int) which is of type Set{Int} where the compiler can deduce the type of the result from the types of the inputs.

Forcing separate compilation for each step size is possible but should not be necessary to get SIMD. I originally took this approach in implementing radix sort in julia base to compile separately for every chunck size, but eventually realized that there wasn't any noticeable benefit from doing that and removed the optimization as it wasn't helpful. This is the commit where I removed it: JuliaLang/julia@17f45e8. The parent commit contains an example that avoids dynamic dispatch for small values by using explicit if statements. It is ugly IMO and I would only support it if it gave substantial speedups.

While it is possible to glean vlauble information from @code_llvm and @code_native outputs, it takes a great deal of understanding of CPU architectures to make accurate value judgements from them. I highly recommend benchmarking with @time (or BenchmarkTools' @btime) and profiling with Profile's @profile (or VSCode's @profview). Because those tools can quickly and easily point to potential problems and detect improvements or regressions.

For example, @profview for _ in 1:1000 sort!(rand(Int, 80); alg=BitonicSort) end reveals that 95% of sorting time is spent in the bitonicsecondstage! function and less than 1% is spent in the bitonicfirststage! function. This indicates substantial a performance problem in the bitonicsecondstage! function.

@profview for _ in 1:200000 sort!(rand(Int, 15); alg=BitonicSort) end indicates that about 65% the sorting time is spent in the Val function and in runtime dispatch, indicating a problem there, too.

for m in n-1:-1:1
bitonicsecondstage!(data, Val(m), o::Ordering)
end
end
return data
end

function bitonicfirststage!(v, ::Val{Gap}, o::Ordering) where Gap
N = length(v)
@inbounds begin
nlw0 marked this conversation as resolved.
Show resolved Hide resolved
gap = 1 << Gap
g2 = gap >> 1
for i in 0:gap:N-1
firstj = max(0, i + gap - N)
for j in firstj:(g2-1)
ia = i + j
ib = i + gap - j - 1
@inbounds comparator!(v, ia + 1, ib + 1, o)
end
end
end
end

function bitonicsecondstage!(v, ::Val{Gap}, o::Ordering) where Gap
N = length(v)
@inbounds begin
for n in Gap:-1:1
gap = 1 << n
for i in 0:gap:N-1
lastj = min(N - 1, N - (i + gap >> 1 + 1))
for j in 0:lastj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When n = 1, will this loop run Θ(N^2) times? IIUC, n will equal 1 Θ(log2(N)^2) times which would give this algorithm a runtime of Ω(log(N)^2*N^2). That seems bad.

Perhaps the bounds of this loop should be tighter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to understand it all, because in the sorting network literature it's all about the delay, or depth... On the wikipedia page, though, it clearly states the network has n log(n)^2 comparators, so I guess that should be the complexity. Not n log(n) but not too shabby, and the main attraction is in parallelism anyways.

By eyeballing the diagram, or unrolling these two stages, we have this:

first_stage(data, 1)

first_stage(data, 2)
second_stage(data, 1)

first_stage(data, 3)
second_stage(data, 2)
second_stage(data, 1)

first_stage(data, 4)
second_stage(data, 3)
second_stage(data, 2)
second_stage(data, 1)

So it's log(n) number of these larger blocks, but they increase linearly, making it O(log(n)^2) function calls. Each of these are linear, so O(n log(n)^2). Do you agree? Notice it's linear because although it's two for-loops, the intervals are such that it's actually a linear traversal. We might even reshape the input to implement this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that that is what the runtime should be, but I think there is an error in the implementation that results in much larger runtimes.

Copy link
Contributor Author

@nlw0 nlw0 Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you're right! There was an extra outer for-loop for the second stage... Hopefully that helps a bit, but I'm still skeptical this is as good as it gets. I mean, I'm not even sure it will really be worth it in the end, but I'm not sure we're at the best for this algorithm yet.

ia = i + j
ib = i + j + gap >> 1
@inbounds comparator!(v, ia + 1, ib + 1, o)
end
end
end
end
end

intlog2(n) = (n > 1) ? 8sizeof(n-1)-leading_zeros(n-1) : 0

Base.@propagate_inbounds function comparator!(v, i, j, o)
a, b = v[i], v[j]
v[i], v[j] = lt(o, b, a) ? (b, a) : (a, b)
end

end # module
6 changes: 3 additions & 3 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ using Random

a = rand(1:10000, 1000)

for alg in [TimSort, HeapSort, RadixSort, CombSort]
for alg in [TimSort, HeapSort, RadixSort, CombSort, BitonicSort]
b = sort(a, alg=alg)
@test issorted(b)
ix = sortperm(a, alg=alg)
Expand Down Expand Up @@ -85,7 +85,7 @@ for n in [0:10..., 100, 101, 1000, 1001]
end

# unstable algorithms
for alg in [HeapSort, CombSort]
for alg in [HeapSort, CombSort, BitonicSort]
p = sortperm(v, alg=alg, order=ord)
@test isperm(p)
@test v[p] == si
Expand All @@ -99,7 +99,7 @@ for n in [0:10..., 100, 101, 1000, 1001]

v = randn_with_nans(n,0.1)
for ord in [Base.Order.Forward, Base.Order.Reverse],
alg in [TimSort, HeapSort, RadixSort, CombSort]
alg in [TimSort, HeapSort, RadixSort, CombSort, BitonicSort]
# test float sorting with NaNs
s = sort(v, alg=alg, order=ord)
@test issorted(s, order=ord)
Expand Down