Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion #1

Open
mlochbaum opened this issue Mar 3, 2022 · 14 comments
Open

Discussion #1

mlochbaum opened this issue Mar 3, 2022 · 14 comments

Comments

@mlochbaum
Copy link

Moving crumsort-related commentary from here (for anyone following along, this issue has some previous discussion).

I got distcrum to maintain candidates so it can augment them from one step to the next. The switch to a more accurate count with a square root approximation was a significant speedup, and even improves on median_of_twentyfive (so I removed it; maybe median of 15 could still be useful). After that the candidate-maintaining version is about the same, but only because I'm stuck sorting all candidates instead of sorting the new ones then merging. The commented out blit_merge_block call, with cnt rather than cnt + npiv in the quadsort before it, causes segfaults. partial_backward_merge doesn't seem to have the same problem, but of course with large enough arrays it tries to use more swap space than there is, and also segfaults.

@scandum
Copy link
Owner

scandum commented Mar 4, 2022

I think blit/tail merging requires a block size of 32, 128, 512, etc. 8 might work though I haven't tried. Quadsort so far seems to be the fastest comparison sort for 8 / 16 / 32 / 128 elements, so 16 / 64 / 256 merges might be optimal. 512+ is probably pointless due to diminishing returns and cache pollution.

I assume you're doing something similar to what gridsort is doing, and as far as I know it is highly optimal.

@scandum
Copy link
Owner

scandum commented Mar 4, 2022

I figured to double check my benchmark and you're correct about the pseudomedian of 25 not being great.

|       Name |    Items | Type |     Best |  Average |  Loops | Samp | Median | Failure |   Comps |     Distribution |
| ---------- | -------- | ---- | -------- | -------- | ------ | ---- | ------ | ------- | ------- | ---------------- |
|   8 median |     2000 |   32 | 0.000135 | 0.000138 |   2000 | 1000 |   35.2 |      21 |    21.2 |           random |
|   9 pseudo |     2000 |   32 | 0.000092 | 0.000100 |   2000 | 1000 |   33.1 |      65 |    12.0 |           random |
|  16 median |     2000 |   32 | 0.000302 | 0.000328 |   2000 | 1000 |   37.0 |       4 |    54.0 |           random |
|  25 pseudo |     2000 |   32 | 0.000365 | 0.000407 |   2000 | 1000 |   36.1 |       1 |    42.0 |           random |

Looks like that slipped by me because the median of sixteen routine was actually checking the median of 17.

It might even be worth it to switch to median of 8.

@mlochbaum
Copy link
Author

I rounded up the number of new candidates to a power of two if there's not enough swap space for a normal merge, and everything seems to work now. There's a definite but small improvement. Sticking with power-of-two sizes, you'd have the option of doubling every two recursions because that cuts the number of existing candidates by a factor of four but the number wanted by just two.

@mlochbaum
Copy link
Author

Comparing to older versions, I find that after some tuning it's about the same as the version from before I started saving candidates. Both definitely better than the version I started with, although I think a lot of that is due to the poor performance of median of 25. But the candidate-saving version uses a factor of √2 more candidates, and the larger sample means it can do better statistics on the array, so maybe it's useful.

@mlochbaum
Copy link
Author

In a crazy coincidence it seems like √n is pretty much exactly the number of candidates needed to distinguish a bad input from a uniform random one in Robin Hood sorting (related to the birthday paradox I think). It changes based on a lot of things, but is usually equal to or slightly more than it. And there's a definite threshold: within a factor of two it changes from something like 15% each false positives and negatives to 1%. Fortunately, the cost of sampling more candidates doesn't seem too bad, and increasing the number to √n isn't noticeably worse.

The statistic that works best, as far as I can tell, is to look at the differences between adjacent candidates after sorting. There's a problem when these are within 16 buffer slots of each other, so if the distance is less than 16, add 16 - distance to a running total. Still need to do some work to figure out what the value should be based on length. I think this is a good outcome because it seems that there won't be any undetectable bad distributions (just bad luck in selection, but quicksort already has that problem), although the precision required to get the test right is annoying.

@mlochbaum
Copy link
Author

Did more work with the RH criterion. The candidate scanning does seem to work well, but it obviously has no power when there's only one candidate selected by pseudomedian of 9. And with limited memory, RH can't be applied immediately and partitioning pushes things down to the median of 9 case. For the measurement below, I just ran RH on any >64-element partition with only one candidate. I think there are probably reasonably quick tests that can be performed though, so the performance shown would be achievable in practice without significant risk of a worst case. Another possibility, if the array is partitioned, is to skip the test initially but disable RH or increase the testing requirement if it drops too many elements from the buffer.

Top line here is with RH disabled completely, and the "n+distcrumsort" one enables it and uses a buffer of length n up to a maximum of 2^18. The slowdown relative to plain RH is mainly due to having to partition to get the memory requirement down, and some other cost. Note that even for the default CRUM_AUX memory, the RH base case is a significant improvement (2-5ns/v) over not using it.

rand

@scandum
Copy link
Owner

scandum commented Mar 9, 2022

That's looking pretty good.

It appears median of 25 is still beneficial to fluxsort, so I think most of the median of 16 gain for crumsort is from being able to save part of the progress of the 16 element sort.

@mlochbaum
Copy link
Author

Wrote this section on techniques with candidate selection. The last paragraph discusses choosing merging rather than partitioning for one step, which is a new idea (to me at least). It's complementary to the idea of counting comparison streaks, in that streaks only make initial passes fast and candidate structure indicates that final passes will be fast. Potentially if the candidates are sorted enough you could just decide to jump straight to quadsort. Or first try pdqsort's approach of an insertion sort that only handles a few out-of-place elements before quitting.

I only described using a 2-way merge there because I'm not sure how well it can be adapted to a 4-way merge if that comes with a requirement for power of two sizes.

Saving candidates in fluxsort is also possible. In fact, you can just use the end of swap space for it. Partitioning will only overwrite that space if it's very unbalanced, and in that case you clearly wouldn't want to save the pivots, and will be switching to quadsort anyway with the current code. On a tangent, the unbalanced partition requirement could be tightened a lot for these large candidate sets, but I don't know by how much.

Another thing to investigate would be a partitioning method that's best at moving blocks of elements. With vector instructions it's easy: compare a vector register or two, and have a fast path that moves those as one unit if the results are all the same. I don't know what the best way would be in plain C, and this doesn't seem like it would be as reliable to detect from pivot candidates.

@matu3ba
Copy link

matu3ba commented Apr 23, 2022

@scandum

  1. Can you please put the editing of benchmarks in the README?
    // edit *sorts[] in main() to add / remove sorts is annoying to search for.
  2. Providing the curl commands is also nicer: curl https://raw.githubusercontent.com/orlp/pdqsort/master/pdqsort.h > pdqsort.h

@scandum
Copy link
Owner

scandum commented Apr 23, 2022

I'll change that to a DEFINE at the top of benchmark.c so it's easier to edit. Good idea regarding curl.

Might be a bit before it travels downstream.

@matu3ba
Copy link

matu3ba commented Jun 4, 2022

While not portable per se, one can also use SIMD instructions for sorting
https://opensource.googleblog.com/2022/06/Vectorized%20and%20performance%20portable%20Quicksort.html

tldr;
"Happily, modern instruction sets (Arm SVE, RISC-V V, x86 AVX-512) include a special instruction suitable for partitioning."
"Our implementation uses Highway's portable SIMD functions, so we do not have to re-implement about 3,000 lines of C++ for each platform."
Sorting performance is around 1 GB/s.

@scandum
Copy link
Owner

scandum commented May 28, 2023

@mlochbaum

I noticed you had an interest in Rust, and I came across this Rust port recently. https://github.com/google/crumsort-rs

Might be interesting to bench to see how it performs. The main focus is random 64-128 bit data, so it omits the analyzer.

@mlochbaum
Copy link
Author

Oh, I can barely stand Rust. At least they have a benchmark I can run, but I'm getting wildly different numbers from theirs. ~80% CPU usage on 2 cores / 4 threads. From u32 the throughput for 4096 looks identical to single-core fluxsort and the largest size hits just about twice that. Looks like all it's doing is sorting the two partitions in parallel which isn't terribly interesting.

u32
┌──────────┬───────────┬──────────────┬─────────────┐
│   length │ algorithm │   throughput │ improvement │
├══════════┼═══════════┼══════════════┼═════════════┤
│     4096 │  pdqsort  │ 29.04Mkeys/s │       0.00% │
│     4096 │ crumsort  │ 50.84Mkeys/s │      75.08% │
│    65536 │  pdqsort  │ 52.63Mkeys/s │       0.00% │
│    65536 │ crumsort  │ 61.99Mkeys/s │      17.77% │
│  1048576 │  pdqsort  │ 48.27Mkeys/s │       0.00% │
│  1048576 │ crumsort  │ 55.38Mkeys/s │      14.73% │
│ 16777216 │  pdqsort  │ 42.71Mkeys/s │       0.00% │
│ 16777216 │ crumsort  │ 48.45Mkeys/s │      13.44% │
└──────────┴───────────┴──────────────┴─────────────┘


u64
┌──────────┬───────────┬──────────────┬─────────────┐
│   length │ algorithm │   throughput │ improvement │
├══════════┼═══════════┼══════════════┼═════════════┤
│     4096 │  pdqsort  │ 26.63Mkeys/s │       0.00% │
│     4096 │ crumsort  │ 44.17Mkeys/s │      65.88% │
│    65536 │  pdqsort  │ 50.92Mkeys/s │       0.00% │
│    65536 │ crumsort  │ 58.45Mkeys/s │      14.79% │
│  1048576 │  pdqsort  │ 47.26Mkeys/s │       0.00% │
│  1048576 │ crumsort  │ 49.38Mkeys/s │       4.49% │
│ 16777216 │  pdqsort  │ 39.74Mkeys/s │       0.00% │
│ 16777216 │ crumsort  │ 38.20Mkeys/s │      -3.89% │
└──────────┴───────────┴──────────────┴─────────────┘

@scandum
Copy link
Owner

scandum commented May 28, 2023

I don't quite get the Rust hype myself, probably because I love writing unsafe code.

Makes sense that parallel only does well on larger distributions. It's used by https://github.com/google/forma, so the optimizations are likely very specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants