Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly optimize slice::sort #39538

Merged
merged 2 commits into from Feb 5, 2017
Merged

Slightly optimize slice::sort #39538

merged 2 commits into from Feb 5, 2017

Conversation

ghost
Copy link

@ghost ghost commented Feb 4, 2017

First, get rid of some bound checks.

Second, instead of comparing by ternary compare function, use a binary function testing whether an element is less than some other element. This apparently makes it easier for the compiler to reason about the code. I've noticed the same effect with pdqsort crate.

Benchmark:

name                                        before ns/iter        after ns/iter         diff ns/iter   diff %
slice::bench::sort_large_ascending          8,969 (8919 MB/s)     7,410 (10796 MB/s)          -1,559  -17.38%
slice::bench::sort_large_big_ascending      355,640 (3599 MB/s)   359,137 (3564 MB/s)          3,497    0.98%
slice::bench::sort_large_big_descending     427,112 (2996 MB/s)   424,721 (3013 MB/s)         -2,391   -0.56%
slice::bench::sort_large_big_random         2,207,799 (579 MB/s)  2,138,804 (598 MB/s)       -68,995   -3.13%
slice::bench::sort_large_descending         13,694 (5841 MB/s)    13,514 (5919 MB/s)            -180   -1.31%
slice::bench::sort_large_mostly_ascending   239,697 (333 MB/s)    203,542 (393 MB/s)         -36,155  -15.08%
slice::bench::sort_large_mostly_descending  270,102 (296 MB/s)    234,263 (341 MB/s)         -35,839  -13.27%
slice::bench::sort_large_random             513,406 (155 MB/s)    470,084 (170 MB/s)         -43,322   -8.44%
slice::bench::sort_large_random_expensive   23,650,321 (3 MB/s)   23,675,098 (3 MB/s)         24,777    0.10%
slice::bench::sort_medium_ascending         143 (5594 MB/s)       132 (6060 MB/s)                -11   -7.69%
slice::bench::sort_medium_descending        197 (4060 MB/s)       188 (4255 MB/s)                 -9   -4.57%
slice::bench::sort_medium_random            3,358 (238 MB/s)      3,271 (244 MB/s)               -87   -2.59%
slice::bench::sort_small_ascending          32 (2500 MB/s)        32 (2500 MB/s)                   0    0.00%
slice::bench::sort_small_big_ascending      97 (13195 MB/s)       97 (13195 MB/s)                  0    0.00%
slice::bench::sort_small_big_descending     247 (5182 MB/s)       249 (5140 MB/s)                  2    0.81%
slice::bench::sort_small_big_random         502 (2549 MB/s)       498 (2570 MB/s)                 -4   -0.80%
slice::bench::sort_small_descending         55 (1454 MB/s)        61 (1311 MB/s)                   6   10.91%
slice::bench::sort_small_random             358 (223 MB/s)        356 (224 MB/s)                  -2   -0.56%

@rust-highfive
Copy link
Collaborator

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

Stjepan Glavina added 2 commits February 4, 2017 16:44
First, get rid of some bound checks.

Second, instead of comparing by ternary `compare` function, use a binary
function testing whether an element is less than some other element.
This apparently makes it easier for the compiler to reason about the
code.

Benchmark:

```
name                                        before ns/iter        after ns/iter         diff ns/iter   diff %
slice::bench::sort_large_ascending          8,969 (8919 MB/s)     7,410 (10796 MB/s)          -1,559  -17.38%
slice::bench::sort_large_big_ascending      355,640 (3599 MB/s)   359,137 (3564 MB/s)          3,497    0.98%
slice::bench::sort_large_big_descending     427,112 (2996 MB/s)   424,721 (3013 MB/s)         -2,391   -0.56%
slice::bench::sort_large_big_random         2,207,799 (579 MB/s)  2,138,804 (598 MB/s)       -68,995   -3.13%
slice::bench::sort_large_descending         13,694 (5841 MB/s)    13,514 (5919 MB/s)            -180   -1.31%
slice::bench::sort_large_mostly_ascending   239,697 (333 MB/s)    203,542 (393 MB/s)         -36,155  -15.08%
slice::bench::sort_large_mostly_descending  270,102 (296 MB/s)    234,263 (341 MB/s)         -35,839  -13.27%
slice::bench::sort_large_random             513,406 (155 MB/s)    470,084 (170 MB/s)         -43,322   -8.44%
slice::bench::sort_large_random_expensive   23,650,321 (3 MB/s)   23,675,098 (3 MB/s)         24,777    0.10%
slice::bench::sort_medium_ascending         143 (5594 MB/s)       132 (6060 MB/s)                -11   -7.69%
slice::bench::sort_medium_descending        197 (4060 MB/s)       188 (4255 MB/s)                 -9   -4.57%
slice::bench::sort_medium_random            3,358 (238 MB/s)      3,271 (244 MB/s)               -87   -2.59%
slice::bench::sort_small_ascending          32 (2500 MB/s)        32 (2500 MB/s)                   0    0.00%
slice::bench::sort_small_big_ascending      97 (13195 MB/s)       97 (13195 MB/s)                  0    0.00%
slice::bench::sort_small_big_descending     247 (5182 MB/s)       249 (5140 MB/s)                  2    0.81%
slice::bench::sort_small_big_random         502 (2549 MB/s)       498 (2570 MB/s)                 -4   -0.80%
slice::bench::sort_small_descending         55 (1454 MB/s)        61 (1311 MB/s)                   6   10.91%
slice::bench::sort_small_random             358 (223 MB/s)        356 (224 MB/s)                  -2   -0.56%
```
Before, the `count` would be copied into the closure and could
potentially be optimized way. This change ensures it's borrowed by
closure and finally consumed by `test::black_box`.
@alexcrichton
Copy link
Member

@bors: r+

Thanks!

@bors
Copy link
Contributor

bors commented Feb 4, 2017

📌 Commit fa457bf has been approved by alexcrichton

frewsxcv added a commit to frewsxcv/rust that referenced this pull request Feb 5, 2017
…alexcrichton

Slightly optimize slice::sort

First, get rid of some bound checks.

Second, instead of comparing by ternary `compare` function, use a binary function testing whether an element is less than some other element. This apparently makes it easier for the compiler to reason about the code. I've noticed the same effect with [pdqsort](https://github.com/stjepang/pdqsort) crate.

Benchmark:

```
name                                        before ns/iter        after ns/iter         diff ns/iter   diff %
slice::bench::sort_large_ascending          8,969 (8919 MB/s)     7,410 (10796 MB/s)          -1,559  -17.38%
slice::bench::sort_large_big_ascending      355,640 (3599 MB/s)   359,137 (3564 MB/s)          3,497    0.98%
slice::bench::sort_large_big_descending     427,112 (2996 MB/s)   424,721 (3013 MB/s)         -2,391   -0.56%
slice::bench::sort_large_big_random         2,207,799 (579 MB/s)  2,138,804 (598 MB/s)       -68,995   -3.13%
slice::bench::sort_large_descending         13,694 (5841 MB/s)    13,514 (5919 MB/s)            -180   -1.31%
slice::bench::sort_large_mostly_ascending   239,697 (333 MB/s)    203,542 (393 MB/s)         -36,155  -15.08%
slice::bench::sort_large_mostly_descending  270,102 (296 MB/s)    234,263 (341 MB/s)         -35,839  -13.27%
slice::bench::sort_large_random             513,406 (155 MB/s)    470,084 (170 MB/s)         -43,322   -8.44%
slice::bench::sort_large_random_expensive   23,650,321 (3 MB/s)   23,675,098 (3 MB/s)         24,777    0.10%
slice::bench::sort_medium_ascending         143 (5594 MB/s)       132 (6060 MB/s)                -11   -7.69%
slice::bench::sort_medium_descending        197 (4060 MB/s)       188 (4255 MB/s)                 -9   -4.57%
slice::bench::sort_medium_random            3,358 (238 MB/s)      3,271 (244 MB/s)               -87   -2.59%
slice::bench::sort_small_ascending          32 (2500 MB/s)        32 (2500 MB/s)                   0    0.00%
slice::bench::sort_small_big_ascending      97 (13195 MB/s)       97 (13195 MB/s)                  0    0.00%
slice::bench::sort_small_big_descending     247 (5182 MB/s)       249 (5140 MB/s)                  2    0.81%
slice::bench::sort_small_big_random         502 (2549 MB/s)       498 (2570 MB/s)                 -4   -0.80%
slice::bench::sort_small_descending         55 (1454 MB/s)        61 (1311 MB/s)                   6   10.91%
slice::bench::sort_small_random             358 (223 MB/s)        356 (224 MB/s)                  -2   -0.56%
```
frewsxcv added a commit to frewsxcv/rust that referenced this pull request Feb 5, 2017
…alexcrichton

Slightly optimize slice::sort

First, get rid of some bound checks.

Second, instead of comparing by ternary `compare` function, use a binary function testing whether an element is less than some other element. This apparently makes it easier for the compiler to reason about the code. I've noticed the same effect with [pdqsort](https://github.com/stjepang/pdqsort) crate.

Benchmark:

```
name                                        before ns/iter        after ns/iter         diff ns/iter   diff %
slice::bench::sort_large_ascending          8,969 (8919 MB/s)     7,410 (10796 MB/s)          -1,559  -17.38%
slice::bench::sort_large_big_ascending      355,640 (3599 MB/s)   359,137 (3564 MB/s)          3,497    0.98%
slice::bench::sort_large_big_descending     427,112 (2996 MB/s)   424,721 (3013 MB/s)         -2,391   -0.56%
slice::bench::sort_large_big_random         2,207,799 (579 MB/s)  2,138,804 (598 MB/s)       -68,995   -3.13%
slice::bench::sort_large_descending         13,694 (5841 MB/s)    13,514 (5919 MB/s)            -180   -1.31%
slice::bench::sort_large_mostly_ascending   239,697 (333 MB/s)    203,542 (393 MB/s)         -36,155  -15.08%
slice::bench::sort_large_mostly_descending  270,102 (296 MB/s)    234,263 (341 MB/s)         -35,839  -13.27%
slice::bench::sort_large_random             513,406 (155 MB/s)    470,084 (170 MB/s)         -43,322   -8.44%
slice::bench::sort_large_random_expensive   23,650,321 (3 MB/s)   23,675,098 (3 MB/s)         24,777    0.10%
slice::bench::sort_medium_ascending         143 (5594 MB/s)       132 (6060 MB/s)                -11   -7.69%
slice::bench::sort_medium_descending        197 (4060 MB/s)       188 (4255 MB/s)                 -9   -4.57%
slice::bench::sort_medium_random            3,358 (238 MB/s)      3,271 (244 MB/s)               -87   -2.59%
slice::bench::sort_small_ascending          32 (2500 MB/s)        32 (2500 MB/s)                   0    0.00%
slice::bench::sort_small_big_ascending      97 (13195 MB/s)       97 (13195 MB/s)                  0    0.00%
slice::bench::sort_small_big_descending     247 (5182 MB/s)       249 (5140 MB/s)                  2    0.81%
slice::bench::sort_small_big_random         502 (2549 MB/s)       498 (2570 MB/s)                 -4   -0.80%
slice::bench::sort_small_descending         55 (1454 MB/s)        61 (1311 MB/s)                   6   10.91%
slice::bench::sort_small_random             358 (223 MB/s)        356 (224 MB/s)                  -2   -0.56%
```
frewsxcv added a commit to frewsxcv/rust that referenced this pull request Feb 5, 2017
…alexcrichton

Slightly optimize slice::sort

First, get rid of some bound checks.

Second, instead of comparing by ternary `compare` function, use a binary function testing whether an element is less than some other element. This apparently makes it easier for the compiler to reason about the code. I've noticed the same effect with [pdqsort](https://github.com/stjepang/pdqsort) crate.

Benchmark:

```
name                                        before ns/iter        after ns/iter         diff ns/iter   diff %
slice::bench::sort_large_ascending          8,969 (8919 MB/s)     7,410 (10796 MB/s)          -1,559  -17.38%
slice::bench::sort_large_big_ascending      355,640 (3599 MB/s)   359,137 (3564 MB/s)          3,497    0.98%
slice::bench::sort_large_big_descending     427,112 (2996 MB/s)   424,721 (3013 MB/s)         -2,391   -0.56%
slice::bench::sort_large_big_random         2,207,799 (579 MB/s)  2,138,804 (598 MB/s)       -68,995   -3.13%
slice::bench::sort_large_descending         13,694 (5841 MB/s)    13,514 (5919 MB/s)            -180   -1.31%
slice::bench::sort_large_mostly_ascending   239,697 (333 MB/s)    203,542 (393 MB/s)         -36,155  -15.08%
slice::bench::sort_large_mostly_descending  270,102 (296 MB/s)    234,263 (341 MB/s)         -35,839  -13.27%
slice::bench::sort_large_random             513,406 (155 MB/s)    470,084 (170 MB/s)         -43,322   -8.44%
slice::bench::sort_large_random_expensive   23,650,321 (3 MB/s)   23,675,098 (3 MB/s)         24,777    0.10%
slice::bench::sort_medium_ascending         143 (5594 MB/s)       132 (6060 MB/s)                -11   -7.69%
slice::bench::sort_medium_descending        197 (4060 MB/s)       188 (4255 MB/s)                 -9   -4.57%
slice::bench::sort_medium_random            3,358 (238 MB/s)      3,271 (244 MB/s)               -87   -2.59%
slice::bench::sort_small_ascending          32 (2500 MB/s)        32 (2500 MB/s)                   0    0.00%
slice::bench::sort_small_big_ascending      97 (13195 MB/s)       97 (13195 MB/s)                  0    0.00%
slice::bench::sort_small_big_descending     247 (5182 MB/s)       249 (5140 MB/s)                  2    0.81%
slice::bench::sort_small_big_random         502 (2549 MB/s)       498 (2570 MB/s)                 -4   -0.80%
slice::bench::sort_small_descending         55 (1454 MB/s)        61 (1311 MB/s)                   6   10.91%
slice::bench::sort_small_random             358 (223 MB/s)        356 (224 MB/s)                  -2   -0.56%
```
bors added a commit that referenced this pull request Feb 5, 2017
Rollup of 12 pull requests

- Successful merges: #39439, #39472, #39481, #39491, #39501, #39509, #39514, #39519, #39526, #39528, #39530, #39538
- Failed merges:
@bors bors merged commit fa457bf into rust-lang:master Feb 5, 2017
@ghost ghost deleted the slightly-optimize-sort branch February 5, 2017 22:23
bors added a commit that referenced this pull request Feb 11, 2017
…ichton

Specialize `PartialOrd<A> for [A] where A: Ord`

This way we can call `cmp` instead of `partial_cmp` in the loop, removing some burden of optimizing `Option`s away from the compiler.

PR #39538 introduced a regression where sorting slices suddenly became slower, since `slice1.lt(slice2)` was much slower than `slice1.cmp(slice2) == Less`. This problem is now fixed.

To verify, I benchmarked this simple program:
```rust
fn main() {
    let mut v = (0..2_000_000).map(|x| x * x * x * 18913515181).map(|x| vec![x, x ^ 3137831591]).collect::<Vec<_>>();
    v.sort();
}
```

Before this PR, it would take 0.95 sec, and now it takes 0.58 sec.
I also tried changing the `is_less` lambda to use `cmp` and `partial_cmp`. Now all three versions (`lt`, `cmp`, `partial_cmp`) are equally performant for sorting slices - all of them take 0.58 sec on the
benchmark.

Tangentially, as soon as we get `default impl`, it might be a good idea to implement a blanket default impl for `lt`, `gt`, `le`, `ge` in terms of `cmp` whenever possible. Today, those four functions by default are only implemented in terms of `partial_cmp`.

r? @alexcrichton
frewsxcv added a commit to frewsxcv/rust that referenced this pull request Feb 11, 2017
…d, r=alexcrichton

Specialize `PartialOrd<A> for [A] where A: Ord`

This way we can call `cmp` instead of `partial_cmp` in the loop, removing some burden of optimizing `Option`s away from the compiler.

PR rust-lang#39538 introduced a regression where sorting slices suddenly became slower, since `slice1.lt(slice2)` was much slower than `slice1.cmp(slice2) == Less`. This problem is now fixed.

To verify, I benchmarked this simple program:
```rust
fn main() {
    let mut v = (0..2_000_000).map(|x| x * x * x * 18913515181).map(|x| vec![x, x ^ 3137831591]).collect::<Vec<_>>();
    v.sort();
}
```

Before this PR, it would take 0.95 sec, and now it takes 0.58 sec.
I also tried changing the `is_less` lambda to use `cmp` and `partial_cmp`. Now all three versions (`lt`, `cmp`, `partial_cmp`) are equally performant for sorting slices - all of them take 0.58 sec on the
benchmark.

Tangentially, as soon as we get `default impl`, it might be a good idea to implement a blanket default impl for `lt`, `gt`, `le`, `ge` in terms of `cmp` whenever possible. Today, those four functions by default are only implemented in terms of `partial_cmp`.

r? @alexcrichton
@ghost ghost mentioned this pull request Mar 17, 2017
bors added a commit that referenced this pull request Mar 21, 2017
Implement feature sort_unstable

Tracking issue for the feature: #40585

This is essentially integration of [pdqsort](https://github.com/stjepang/pdqsort) into libcore.

There's plenty of unsafe blocks to review. The heart of pdqsort is `fn partition_in_blocks` and is probably the most challenging function to understand. It requires some patience, but let me know if you find it too difficult - comments could always be improved.

#### Changes

* Added `sort_unstable` feature.
* Tweaked insertion sort constants for stable sort. Sorting integers is now up to 5% slower, but sorting big elements is much faster (in particular, `sort_large_big_random` is 35% faster). The old constants were highly optimized for sorting integers, so overall the configuration is more balanced now. A minor regression in case of integers is forgivable as we recently had performance improvements (#39538) that completely make up for it.
* Removed some uninteresting sort benchmarks.
* Added a new sort benchmark for string sorting.

#### Benchmarks

The following table compares stable and unstable sorting:
```
name                                 stable ns/iter        unstable ns/iter     diff ns/iter   diff %
slice::sort_large_ascending          7,240 (11049 MB/s)    7,380 (10840 MB/s)            140    1.93%
slice::sort_large_big_random         1,454,138 (880 MB/s)  910,269 (1406 MB/s)      -543,869  -37.40%
slice::sort_large_descending         13,450 (5947 MB/s)    10,895 (7342 MB/s)         -2,555  -19.00%
slice::sort_large_mostly_ascending   204,041 (392 MB/s)    88,639 (902 MB/s)        -115,402  -56.56%
slice::sort_large_mostly_descending  217,109 (368 MB/s)    99,009 (808 MB/s)        -118,100  -54.40%
slice::sort_large_random             477,257 (167 MB/s)    346,028 (231 MB/s)       -131,229  -27.50%
slice::sort_large_random_expensive   21,670,537 (3 MB/s)   22,710,238 (3 MB/s)     1,039,701    4.80%
slice::sort_large_strings            6,284,499 (38 MB/s)   6,410,896 (37 MB/s)       126,397    2.01%
slice::sort_medium_random            3,515 (227 MB/s)      3,327 (240 MB/s)             -188   -5.35%
slice::sort_small_ascending          42 (1904 MB/s)        41 (1951 MB/s)                 -1   -2.38%
slice::sort_small_big_random         503 (2544 MB/s)       514 (2490 MB/s)                11    2.19%
slice::sort_small_descending         72 (1111 MB/s)        69 (1159 MB/s)                 -3   -4.17%
slice::sort_small_random             369 (216 MB/s)        367 (217 MB/s)                 -2   -0.54%
```

Interesting cases:
* Expensive comparison function and string sorting - it's a really close race, but timsort performs a slightly smaller number of comparisons. This is a natural difference of bottom-up merging versus top-down partitioning.
* `large_descending` - unstable sort is faster, but both sorts should have equivalent performance. Both just check whether the slice is descending and if so, they reverse it. I blame LLVM for the discrepancy.

r? @alexcrichton
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants