Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sorting networks #1

Open
scandum opened this issue Feb 2, 2024 · 5 comments
Open

sorting networks #1

scandum opened this issue Feb 2, 2024 · 5 comments

Comments

@scandum
Copy link

scandum commented Feb 2, 2024

I haven't done much work on sorting lately, but figured to share some findings.

I looked into unstable sorting networks this week and haven't been able to reproduce the suggested performance gain. I suspect there's some cache pollution due to the large instruction size when utilizing sorting networks in a quicksort.

So far my best results have been using piposort on a threshold of 96, with unrolled 4, 8, 16 element parity merges and twice-unguarded insertion to fill the gaps.

As for the high performance reported by rust sorts, I suspect it's primarily due to rust compiling ? : ternary operations as branchless. This makes the benchmarks quite misleading, since there's no such thing in gcc.

When comparing crumsort compiled with clang to pdqsort compiled with g++, pdqsort is nearly two times slower than crumsort for 10000 elements.

@scandum
Copy link
Author

scandum commented Feb 10, 2024

I've managed to achieve some gains with unrolled parity merges for clang, though it makes performance slightly worse for gcc.

I also updated a variant of fluxsort with a bottom up analyzer, it's a bit lackluster as it's not well optimized to deal with odd inputs, and the 4-way top down analyzer synergizes well with quadsort. I placed the code (skipsort) in the wolfsort repo.

Areas where gains should still be possible are improvements to galloping merges, and memory reductions.

What might also be of interest is Logsort (https://github.com/aphitorite/Logsort). It claims to be n log n stable inplace, though my own benchmarks suggest it still exhibits n log² n moves, as I couldn't get it to run faster than blitsort on my own machine.

@mlochbaum
Copy link
Owner

What are you using to do swaps in sorting networks? Of course if it branches it's not going to work. I use xor (something like d = (a^b) &- (a>b); a^=d; b^=d) and that seems not too much slower than cmov. Also worth trying __builtin_unpredictable. This was broken in clang but fixed last year. Annoying that C has such bad codegen but I wouldn't say Rust doing it faster is "misleading".

@scandum
Copy link
Author

scandum commented Feb 10, 2024

I've been using

x = array[0] > array[1]; swap = array[!x]; array[0] = array[x]; array[1] = swap;

I use a slightly slower but similar method for non-adjacent. I didn't have good results with xor in the past. I'll have another look at __builtin_unpredictable, but I assume it won't make gcc compile ? : branchless?

@dzaima
Copy link

dzaima commented Feb 11, 2024

__builtin_unpredictable is a clang-only thing; afaik there's no unpredictability hint on gcc (__builtin_expect_with_probability(x, 0, 0.5) feels like could be it but I haven't encountered it doing anything). But gcc and clang can often do ? : branchlessly themselves (maybe excluding cases where only one case depends on a load). Here's some compiler explorer.

@scandum
Copy link
Author

scandum commented Feb 11, 2024

That approach won't be great for string comparisons, and mileage will vary depending on the gcc version.

I'm also wary of ending up focusing on micro-optimizations void of actual algorithmic improvements, though sadly it's probably 95% of what I do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants