sort: Fast path for already sorted data #29675

bluss · 2015-11-07T16:49:37Z

sort: Fast path for already sorted data

When merging two sorted blocks left and right if the last element in
left is <= the first in right, the blocks are already in sorted order.

Add this as an additional fast path by simply copying the whole left
block into the output and advancing the left pointer. The right block is
then treated the same way by the already present logic in the merge
loop.

Can reduce runtime of .sort() to less than 50% of the previous, if the data
was already perfectly sorted. Sorted data with a few swaps are also
sorted quicker than before. The overhead of one comparison per merge
seems to be negligible.

When merging two sorted blocks `left` and `right` if the last element in `left` is <= the first in `right`, the blocks are already sorted. Add this as an additional fast path by simply copying the whole left block into the output and advancing the left pointer. The right block is then treated the same way by the already present logic in the merge loop. Reduces runtime of .sort() to less than 50% of the previous, if the data was already perfectly sorted. Sorted data with a few swaps are also sorted quicker than before. The overhead of one comparison per merge seems to be negligible.

rust-highfive · 2015-11-07T16:49:43Z

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

bluss · 2015-11-07T16:53:22Z

Out of tree benchmarks look like this:

“t” is for the number of elements in a random i32 vector,

libstdsort is a plain clone + sort of the random vector, the libstd_sorted variants are clone + sort of the same vector but presorted.

I'm comparing my nightly rustc vs the patched rustc built from the tree. This results in a strange effect for the 10, 100 element case in the “after” build: what's happening is that an unoptimized)(?) jemalloc overhead dominates. Looks like a nice improvement when taking that into account.

before:

test t10000::libstd_sorted          ... bench:     429,275 ns/iter (+/- 1,836)
test t10000::libstd_sorted_10swaps  ... bench:     511,796 ns/iter (+/- 62,528)
test t10000::libstd_sorted_100swaps ... bench:     606,273 ns/iter (+/- 943)
test t10000::libstdsort             ... bench:     996,085 ns/iter (+/- 1,115)
test t1000::libstd_sorted           ... bench:      26,033 ns/iter (+/- 765)
test t1000::libstd_sorted_10swaps   ... bench:      35,167 ns/iter (+/- 507)
test t1000::libstd_sorted_100swaps  ... bench:      45,044 ns/iter (+/- 206)
test t1000::libstdsort              ... bench:      65,519 ns/iter (+/- 221)
test t100::libstd_sorted            ... bench:       1,842 ns/iter (+/- 9)
test t100::libstd_sorted_10swaps    ... bench:       2,675 ns/iter (+/- 20)
test t100::libstdsort               ... bench:       3,999 ns/iter (+/- 24)
test t10::libstd_sorted             ... bench:          74 ns/iter (+/- 1)
test t10::libstdsort                ... bench:         132 ns/iter (+/- 1)

after:

test t10000::libstd_sorted          ... bench:     175,627 ns/iter (+/- 747)
test t10000::libstd_sorted_10swaps  ... bench:     406,447 ns/iter (+/- 803)
test t10000::libstd_sorted_100swaps ... bench:     587,881 ns/iter (+/- 885)
test t10000::libstdsort             ... bench:     984,554 ns/iter (+/- 1,908)
test t1000::libstd_sorted           ... bench:      13,669 ns/iter (+/- 10)
test t1000::libstd_sorted_10swaps   ... bench:      34,950 ns/iter (+/- 90)
test t1000::libstd_sorted_100swaps  ... bench:      46,550 ns/iter (+/- 113)
test t1000::libstdsort              ... bench:      66,939 ns/iter (+/- 283)
test t100::libstd_sorted            ... bench:       5,450 ns/iter (+/- 10)
test t100::libstd_sorted_10swaps    ... bench:       6,798 ns/iter (+/- 9)
test t100::libstdsort               ... bench:       8,114 ns/iter (+/- 16)
test t10::libstd_sorted             ... bench:       2,204 ns/iter (+/- 4)
test t10::libstdsort                ... bench:       2,266 ns/iter (+/- 3)

If someone wants to run the in-tree benchmarks, I'd be thrilled to see that.

bluss · 2015-11-07T17:04:34Z

src/libcollections/slice.rs

+                // if left[last] <= right[0], they are already in order:
+                // fast-forward the left side (the right side is handled
+                // in the loop).
+                if compare(&*right.offset(-1), &*right) != Greater {


This needs a bounds check actually.

The right part must not be empty.

Gankra · 2015-11-07T17:27:17Z

Great wins on big runs! 👍

I'm a bit confused with what's happening in the small input cases. You mention jemalloc but I don't follow why that would have changed between runs?

bluss · 2015-11-07T17:30:37Z

I'm comparing my nightly rustc vs the patched rustc built from the tree.

It's this.

huonw · 2015-11-07T23:11:51Z

I'm comparing my nightly rustc vs the patched rustc built from the tree.

It's this.

I don't understand why that's relevant? If optimisations are enabled, they should be generating essentially the same code?

huonw · 2015-11-07T23:17:03Z

Oh, I see, if the patched rustc was built without optimisations, then jemalloc may've been too, and hence runs slower... This is somewhat suspicious, so r=me, except for nailing down the exact cause of the regression (and/or getting a version of the benchmarks that shows it doesn't exist for a "real" build).

bluss · 2015-11-13T01:02:16Z

Ok, I ran the in tree benchmarks:

before

slice::bench::sort_big_random_large     ... bench:   4,749,545 ns/iter (+/- 948,440) = 67 MB/s
slice::bench::sort_big_random_medium    ... bench:      36,870 ns/iter (+/- 812) = 86 MB/s
slice::bench::sort_big_random_small     ... bench:       1,798 ns/iter (+/- 100) = 88 MB/s
slice::bench::sort_big_sorted           ... bench:     772,643 ns/iter (+/- 11,515) = 414 MB/s
slice::bench::sort_random_large         ... bench:   1,806,359 ns/iter (+/- 36,855) = 44 MB/s
slice::bench::sort_random_medium        ... bench:      14,509 ns/iter (+/- 198) = 55 MB/s
slice::bench::sort_random_small         ... bench:         685 ns/iter (+/- 20) = 58 MB/s
slice::bench::sort_sorted               ... bench:     397,642 ns/iter (+/- 1,082) = 100 MB/s

after

slice::bench::sort_big_random_large     ... bench:   4,718,159 ns/iter (+/- 41,542) = 67 MB/s
slice::bench::sort_big_random_medium    ... bench:      36,909 ns/iter (+/- 292) = 86 MB/s
slice::bench::sort_big_random_small     ... bench:       1,888 ns/iter (+/- 13) = 84 MB/s
slice::bench::sort_big_sorted           ... bench:     615,455 ns/iter (+/- 8,196) = 519 MB/s
slice::bench::sort_random_large         ... bench:   1,796,680 ns/iter (+/- 13,319) = 44 MB/s
slice::bench::sort_random_medium        ... bench:      14,588 ns/iter (+/- 352) = 54 MB/s
slice::bench::sort_random_small         ... bench:         699 ns/iter (+/- 9) = 57 MB/s
slice::bench::sort_sorted               ... bench:     134,979 ns/iter (+/- 1,101) = 296 MB/s

Which I think looks good?

bluss · 2015-11-13T01:06:15Z

The "big" cases sort elements of type BigSortable = (u64, u64, u64, u64);, small cases u64, which makes a surprising difference.

huonw · 2015-11-13T01:21:22Z

@bors r+

bors · 2015-11-13T01:21:23Z

📌 Commit 0f5e30d has been approved by huonw

sort: Fast path for already sorted data When merging two sorted blocks `left` and `right` if the last element in `left` is <= the first in `right`, the blocks are already in sorted order. Add this as an additional fast path by simply copying the whole left block into the output and advancing the left pointer. The right block is then treated the same way by the already present logic in the merge loop. Can reduce runtime of .sort() to less than 50% of the previous, if the data was already perfectly sorted. Sorted data with a few swaps are also sorted quicker than before. The overhead of one comparison per merge seems to be negligible.

bors · 2015-11-13T01:44:59Z

⌛ Testing commit 0f5e30d with merge 8a813e0...

Gankra · 2015-11-13T03:15:35Z

@bluss thanks for following up! 👍

bors · 2015-11-13T05:56:49Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-64-debug-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-x-android-t, auto-linux-cross-opt, auto-linux-musl-64-opt, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt, auto-win-msvc-32-opt, auto-win-msvc-64-opt

@huonw

Implement a faster sort algorithm Hi everyone, this is my first PR. I've made some changes to the standard sort algorithm, starting out with a few tweaks here and there, but in the end this endeavour became a complete rewrite of it. #### Summary Changes: * Improved performance, especially on partially sorted inputs. * Performs less comparisons on both random and partially sorted inputs. * Decreased the size of temporary memory: the new sort allocates 4x less. Benchmark: ``` name out1 ns/iter out2 ns/iter diff ns/iter diff % slice::bench::sort_large_ascending 85,323 (937 MB/s) 8,970 (8918 MB/s) -76,353 -89.49% slice::bench::sort_large_big_ascending 2,135,297 (599 MB/s) 355,955 (3595 MB/s) -1,779,342 -83.33% slice::bench::sort_large_big_descending 2,266,402 (564 MB/s) 416,479 (3073 MB/s) -1,849,923 -81.62% slice::bench::sort_large_big_random 3,053,031 (419 MB/s) 1,921,389 (666 MB/s) -1,131,642 -37.07% slice::bench::sort_large_descending 313,181 (255 MB/s) 14,725 (5432 MB/s) -298,456 -95.30% slice::bench::sort_large_mostly_ascending 287,706 (278 MB/s) 243,204 (328 MB/s) -44,502 -15.47% slice::bench::sort_large_mostly_descending 415,078 (192 MB/s) 271,028 (295 MB/s) -144,050 -34.70% slice::bench::sort_large_random 545,872 (146 MB/s) 521,559 (153 MB/s) -24,313 -4.45% slice::bench::sort_large_random_expensive 30,321,770 (2 MB/s) 23,533,735 (3 MB/s) -6,788,035 -22.39% slice::bench::sort_medium_ascending 616 (1298 MB/s) 155 (5161 MB/s) -461 -74.84% slice::bench::sort_medium_descending 1,952 (409 MB/s) 202 (3960 MB/s) -1,750 -89.65% slice::bench::sort_medium_random 3,646 (219 MB/s) 3,421 (233 MB/s) -225 -6.17% slice::bench::sort_small_ascending 39 (2051 MB/s) 34 (2352 MB/s) -5 -12.82% slice::bench::sort_small_big_ascending 96 (13333 MB/s) 96 (13333 MB/s) 0 0.00% slice::bench::sort_small_big_descending 248 (5161 MB/s) 243 (5267 MB/s) -5 -2.02% slice::bench::sort_small_big_random 501 (2554 MB/s) 490 (2612 MB/s) -11 -2.20% slice::bench::sort_small_descending 95 (842 MB/s) 63 (1269 MB/s) -32 -33.68% slice::bench::sort_small_random 372 (215 MB/s) 354 (225 MB/s) -18 -4.84% ``` #### Background First, let me just do a quick brain dump to discuss what I learned along the way. The official documentation says that the standard sort in Rust is a stable sort. This constraint is thus set in stone and immediately rules out many popular sorting algorithms. Essentially, the only algorithms we might even take into consideration are: 1. [Merge sort](https://en.wikipedia.org/wiki/Merge_sort) 2. [Block sort](https://en.wikipedia.org/wiki/Block_sort) (famous implementations are [WikiSort](https://github.com/BonzaiThePenguin/WikiSort) and [GrailSort](https://github.com/Mrrl/GrailSort)) 3. [TimSort](https://en.wikipedia.org/wiki/Timsort) Actually, all of those are just merge sort flavors. :) The current standard sort in Rust is a simple iterative merge sort. It has three problems. First, it's slow on partially sorted inputs (even though #29675 helped quite a bit). Second, it always makes around `log(n)` iterations copying the entire array between buffers, no matter what. Third, it allocates huge amounts of temporary memory (a buffer of size `2*n`, where `n` is the size of input). The problem of auxilliary memory allocation is a tough one. Ideally, it would be best for our sort to allocate `O(1)` additional memory. This is what block sort (and it's variants) does. However, it's often very complicated (look at [this](https://github.com/BonzaiThePenguin/WikiSort/blob/master/WikiSort.cpp)) and even then performs rather poorly. The author of WikiSort claims good performance, but that must be taken with a grain of salt. It performs well in comparison to `std::stable_sort` in C++. It can even beat `std::sort` on partially sorted inputs, but on random inputs it's always far worse. My rule of thumb is: high performance, low memory overhead, stability - choose two. TimSort is another option. It allocates a buffer of size `n/2`, which is not great, but acceptable. Performs extremelly well on partially sorted inputs. However, it seems pretty much all implementations suck on random inputs. I benchmarked implementations in [Rust](https://github.com/notriddle/rust-timsort), [C++](https://github.com/gfx/cpp-TimSort), and [D](https://github.com/dlang/phobos/blob/fd518eb310a9494cccf28c54892542b052c49669/std/algorithm/sorting.d#L2062). The results were a bit disappointing. It seems bad performance is due to complex galloping procedures in hot loops. Galloping noticeably improves performance on partially sorted inputs, but worsens it on random ones. #### The new algorithm Choosing the best algorithm is not easy. Plain merge sort is bad on partially sorted inputs. TimSort is bad on random inputs and block sort is even worse. However, if we take the main ideas from TimSort (intelligent merging strategy of sorted runs) and drop galloping, then we'll have great performance on random inputs and it won't be bad on partially sorted inputs either. That is exactly what this new algorithm does. I can't call it TimSort, since it steals just a few of it's ideas. Complete TimSort would be a much more complex and elaborate implementation. In case we in the future figure out how to incorporate more of it's ideas into this implementation without crippling performance on random inputs, it's going to be very easy to extend. I also did several other minor improvements, like reworked insertion sort to make it faster. There are also new, more thorough benchmarks and panic safety tests. The final code is not terribly complex and has less unsafe code than I anticipated, but there's still plenty of it that should be carefully reviewed. I did my best at documenting non-obvious code. I'd like to notify several people of this PR, since they might be interested and have useful insights: 1. @huonw because he wrote the [original merge sort](#11064). 2. @alexcrichton because he was involved in multiple discussions of it. 3. @veddan because he wrote [introsort](https://github.com/veddan/rust-introsort) in Rust. 4. @notriddle because he wrote [TimSort](https://github.com/notriddle/rust-timsort) in Rust. 5. @bluss because he had an attempt at writing WikiSort in Rust. 6. @gnzlbg, @rkruppe, and @mark-i-m because they were involved in discussion #36318. **P.S.** [quickersort](https://github.com/notriddle/quickersort) describes itself as being universally [faster](https://github.com/notriddle/quickersort/blob/master/perf.txt) than the standard sort, which is true. However, if this PR gets merged, things might [change](https://gist.github.com/stjepang/b9f0c3eaa0e1f1280b61b963dae19a30) a bit. ;)

rust-highfive assigned aturon Nov 7, 2015

bluss reviewed Nov 7, 2015
View reviewed changes

sort: Guard the fast path by length check

0f5e30d

The right part must not be empty.

bors merged commit 0f5e30d into rust-lang:master Nov 13, 2015

bluss deleted the merge-sort-fastpath branch November 13, 2015 12:10

alexcrichton added the relnotes Marks issues that should be documented in the release notes of the next release. label Nov 16, 2015

ghost mentioned this pull request Dec 6, 2016

Implement a faster sort algorithm #38192

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sort: Fast path for already sorted data #29675

sort: Fast path for already sorted data #29675

bluss commented Nov 7, 2015

rust-highfive commented Nov 7, 2015

bluss commented Nov 7, 2015

bluss Nov 7, 2015

Gankra commented Nov 7, 2015

bluss commented Nov 7, 2015

huonw commented Nov 7, 2015

huonw commented Nov 7, 2015

bluss commented Nov 13, 2015

bluss commented Nov 13, 2015

huonw commented Nov 13, 2015

bors commented Nov 13, 2015

bors commented Nov 13, 2015

Gankra commented Nov 13, 2015

bors commented Nov 13, 2015

sort: Fast path for already sorted data #29675

sort: Fast path for already sorted data #29675

Conversation

bluss commented Nov 7, 2015

rust-highfive commented Nov 7, 2015

bluss commented Nov 7, 2015

bluss Nov 7, 2015

Choose a reason for hiding this comment

Gankra commented Nov 7, 2015

bluss commented Nov 7, 2015

huonw commented Nov 7, 2015

huonw commented Nov 7, 2015

bluss commented Nov 13, 2015

bluss commented Nov 13, 2015

huonw commented Nov 13, 2015

bors commented Nov 13, 2015

bors commented Nov 13, 2015

Gankra commented Nov 13, 2015

bors commented Nov 13, 2015