sorting networks #1

scandum · 2024-02-02T15:06:04Z

I haven't done much work on sorting lately, but figured to share some findings.

I looked into unstable sorting networks this week and haven't been able to reproduce the suggested performance gain. I suspect there's some cache pollution due to the large instruction size when utilizing sorting networks in a quicksort.

So far my best results have been using piposort on a threshold of 96, with unrolled 4, 8, 16 element parity merges and twice-unguarded insertion to fill the gaps.

As for the high performance reported by rust sorts, I suspect it's primarily due to rust compiling ? : ternary operations as branchless. This makes the benchmarks quite misleading, since there's no such thing in gcc.

When comparing crumsort compiled with clang to pdqsort compiled with g++, pdqsort is nearly two times slower than crumsort for 10000 elements.

The text was updated successfully, but these errors were encountered:

scandum · 2024-02-10T12:44:20Z

I've managed to achieve some gains with unrolled parity merges for clang, though it makes performance slightly worse for gcc.

I also updated a variant of fluxsort with a bottom up analyzer, it's a bit lackluster as it's not well optimized to deal with odd inputs, and the 4-way top down analyzer synergizes well with quadsort. I placed the code (skipsort) in the wolfsort repo.

Areas where gains should still be possible are improvements to galloping merges, and memory reductions.

What might also be of interest is Logsort (https://github.com/aphitorite/Logsort). It claims to be n log n stable inplace, though my own benchmarks suggest it still exhibits n log² n moves, as I couldn't get it to run faster than blitsort on my own machine.

mlochbaum · 2024-02-10T21:41:15Z

What are you using to do swaps in sorting networks? Of course if it branches it's not going to work. I use xor (something like d = (a^b) &- (a>b); a^=d; b^=d) and that seems not too much slower than cmov. Also worth trying __builtin_unpredictable. This was broken in clang but fixed last year. Annoying that C has such bad codegen but I wouldn't say Rust doing it faster is "misleading".

scandum · 2024-02-10T23:25:28Z

I've been using

x = array[0] > array[1]; swap = array[!x]; array[0] = array[x]; array[1] = swap;

I use a slightly slower but similar method for non-adjacent. I didn't have good results with xor in the past. I'll have another look at __builtin_unpredictable, but I assume it won't make gcc compile ? : branchless?

dzaima · 2024-02-11T01:13:04Z

__builtin_unpredictable is a clang-only thing; afaik there's no unpredictability hint on gcc (__builtin_expect_with_probability(x, 0, 0.5) feels like could be it but I haven't encountered it doing anything). But gcc and clang can often do ? : branchlessly themselves (maybe excluding cases where only one case depends on a load). Here's some compiler explorer.

scandum · 2024-02-11T08:58:04Z

That approach won't be great for string comparisons, and mileage will vary depending on the gcc version.

I'm also wary of ending up focusing on micro-optimizations void of actual algorithmic improvements, though sadly it's probably 95% of what I do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sorting networks #1

sorting networks #1

scandum commented Feb 2, 2024

scandum commented Feb 10, 2024

mlochbaum commented Feb 10, 2024

scandum commented Feb 10, 2024

dzaima commented Feb 11, 2024 •

edited

Loading

scandum commented Feb 11, 2024

sorting networks #1

sorting networks #1

Comments

scandum commented Feb 2, 2024

scandum commented Feb 10, 2024

mlochbaum commented Feb 10, 2024

scandum commented Feb 10, 2024

dzaima commented Feb 11, 2024 • edited Loading

scandum commented Feb 11, 2024

dzaima commented Feb 11, 2024 •

edited

Loading