-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sorting networks #1
Comments
I've managed to achieve some gains with unrolled parity merges for clang, though it makes performance slightly worse for gcc. I also updated a variant of fluxsort with a bottom up analyzer, it's a bit lackluster as it's not well optimized to deal with odd inputs, and the 4-way top down analyzer synergizes well with quadsort. I placed the code (skipsort) in the wolfsort repo. Areas where gains should still be possible are improvements to galloping merges, and memory reductions. What might also be of interest is Logsort (https://github.com/aphitorite/Logsort). It claims to be n log n stable inplace, though my own benchmarks suggest it still exhibits n log² n moves, as I couldn't get it to run faster than blitsort on my own machine. |
What are you using to do swaps in sorting networks? Of course if it branches it's not going to work. I use xor (something like |
I've been using
I use a slightly slower but similar method for non-adjacent. I didn't have good results with xor in the past. I'll have another look at |
|
That approach won't be great for string comparisons, and mileage will vary depending on the gcc version. I'm also wary of ending up focusing on micro-optimizations void of actual algorithmic improvements, though sadly it's probably 95% of what I do. |
I haven't done much work on sorting lately, but figured to share some findings.
I looked into unstable sorting networks this week and haven't been able to reproduce the suggested performance gain. I suspect there's some cache pollution due to the large instruction size when utilizing sorting networks in a quicksort.
So far my best results have been using piposort on a threshold of 96, with unrolled 4, 8, 16 element parity merges and twice-unguarded insertion to fill the gaps.
As for the high performance reported by rust sorts, I suspect it's primarily due to rust compiling
? :
ternary operations as branchless. This makes the benchmarks quite misleading, since there's no such thing in gcc.When comparing crumsort compiled with clang to pdqsort compiled with g++, pdqsort is nearly two times slower than crumsort for 10000 elements.
The text was updated successfully, but these errors were encountered: