-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sortperm faster for large vectors #33
Conversation
Using a custom PermVector which ensures that both indices and values are sorted at the same time ensures cache locality.
This approach is faster than PermVector.
Do we have test cases for non-random inputs? It perhaps begs the question what other cases should one test on. |
trait-based off e.g. |
Yes, clearly we need more benchmarks, especially if we have to identify a size threshold.
Sure, taking two different methods based on a threshold is always possible. But it's not great in terms of code simplicity. I should have noted that somehow data.table's |
it's not great in terms of code simplicity (agreed) My note presupposed that this is a widely used and for certain classes of application, unusually performance-sensitive function. The estimated algorithmic split length appears sufficiently high that a sizeable count of uses of the function would incur that introduced slowness. Simultaneously, the import of the introduced fastness for very large vectors is similarly significant. I am not aware of an internally adaptive approach that is relatively lightweight. There certainly may be .. |
For reference, I have implemented faster The performance difference is stark for |
Thanks. I've just tried that, and for julia> x = rand(-1000_000:1000_000, 10_000);
# Before
julia> @btime sortperm(x, alg=RadixSort);
395.240 μs (19 allocations: 350.00 KiB)
# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
403.349 μs (23 allocations: 506.41 KiB)
julia> @btime fsortperm(x);
953.816 μs (19 allocations: 2.31 MiB)
julia> x = rand(-1000_000:1000_000, 100_000);
# Before
julia> @btime sortperm(x, alg=RadixSort);
4.356 ms (19 allocations: 1.72 MiB)
# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
4.607 ms (23 allocations: 3.24 MiB)
julia> @btime fsortperm(x);
4.982 ms (19 allocations: 5.05 MiB)
julia> x = rand(-1000_000:1000_000, 1000_000);
# Before
julia> @btime sortperm(x, alg=RadixSort);
272.243 ms (19 allocations: 15.45 MiB)
# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
54.644 ms (23 allocations: 30.71 MiB)
julia> @btime fsortperm(x);
56.045 ms (19 allocations: 32.52 MiB)
julia> x = rand(-10_000_000:10_000_000, 10_000_000);
# Before
julia> @btime sortperm(x, alg=RadixSort);
7.812 s (19 allocations: 152.78 MiB)
# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
735.583 ms (23 allocations: 305.37 MiB)
julia> @btime fsortperm(x);
682.264 ms (19 allocations: 307.18 MiB)
(BTW, I had to use Can you explain what's the approach taken by |
Oh, there is a bug? I might look at it closely later. But please submit a bug report if you find more bugs.
fsortperm(x) = begin
# not the actual implementation of `fsortperm`, just an illustration
x_and_index = collect(zip(x, 1:length(x)))
sort!(x_and_index, by = x->x[1]) # sort by x, but index gets shuffled around too
map(x->x[2], x_and_index) # return the index only
end Actually, the original use case, I had was I wanted to sort the column and carry an index as well, so that after sorting it, I have the sorted column and a sortperm index. So I can rearrange DataFrames more efficiently this way. The real power of |
Yes I think you need to handle overflow (see how
OK, so that's essentially what this PR does, right? |
Sounds like it |
I'd like to revive this so that we can get fast radix sort for grouping in DataFrames. Maybe we could even use radix sort in Base for some types, like R does for some time. I've added a commit so that below a size of 2MB the current approach is still used, and we switch to the next approach only above that. That way we get the best of both worlds according to my benchmarks (compare below with OP). I also checked other types, and the threshold works well, except for I've also reverted to the cleaner approach of introducing julia> using SortingAlgorithms, BenchmarkTools
julia> x = rand(Int, 10_000);
julia> @btime sort(x, alg=RadixSort);
401.476 μs (18 allocations: 349.98 KiB)
julia> @btime sortperm(x, alg=RadixSort);
466.175 μs (19 allocations: 350.00 KiB)
julia> x = rand(Int, 100_000);
julia> @btime sort(x, alg=RadixSort);
5.048 ms (18 allocations: 1.72 MiB)
julia> @btime sortperm(x, alg=RadixSort);
6.344 ms (19 allocations: 1.72 MiB)
julia> x = rand(Int, 1_000_000);
julia> @btime sort(x, alg=RadixSort);
56.672 ms (18 allocations: 15.45 MiB)
julia> @btime sortperm(x, alg=RadixSort);
95.381 ms (25 allocations: 38.34 MiB)
julia> x = rand(Int, 10_000_000);
julia> @btime sort(x, alg=RadixSort);
677.512 ms (18 allocations: 152.78 MiB)
julia> @btime sortperm(x, alg=RadixSort);
1.188 s (25 allocations: 381.66 MiB) (Note CI won't pass before #37 is merged.) |
thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update and merge with Project.toml. Can get this merged?
@@ -42,6 +42,27 @@ end | |||
|
|||
## Radix sort | |||
|
|||
struct PermElement{T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some comment for what this is for?
On Julia master, thanks to JuliaLang/julia#44230, julia> using SortingAlgorithms, BenchmarkTools
## 10k entries
julia> x = rand(Int, 10_000);
julia> @btime sort(x);
177.246 μs (6 allocations: 164.56 KiB)
julia> @btime sort(x, alg=RadixSort);
178.927 μs (18 allocations: 349.89 KiB)
julia> @btime sortperm(x);
508.140 μs (3 allocations: 78.19 KiB)
julia> @btime sortperm(x, alg=RadixSort);
221.536 μs (19 allocations: 349.91 KiB)
## 100k entries
julia> x = rand(Int, 100_000);
julia> @btime sort(x);
1.934 ms (6 allocations: 1.53 MiB)
julia> @btime sort(x, alg=RadixSort);
1.828 ms (18 allocations: 1.71 MiB)
julia> @btime sortperm(x);
7.118 ms (3 allocations: 781.31 KiB)
julia> @btime sortperm(x, alg=RadixSort);
2.507 ms (19 allocations: 1.71 MiB)
## 1M entries
julia> x = rand(Int, 1_000_000);
julia> @btime sort(x);
26.452 ms (6 allocations: 15.27 MiB)
julia> @btime sort(x, alg=RadixSort);
23.970 ms (18 allocations: 15.45 MiB)
julia> @btime sortperm(x);
120.149 ms (3 allocations: 7.63 MiB)
julia> @btime sortperm(x, alg=RadixSort);
302.748 ms (19 allocations: 15.45 MiB)
## 10M entries
julia> x = rand(Int, 10_000_000);
julia> @btime sort(x);
326.074 ms (6 allocations: 152.60 MiB)
julia> @btime sort(x, alg=RadixSort);
320.109 ms (18 allocations: 152.78 MiB)
julia> @btime sortperm(x);
1.977 s (3 allocations: 76.29 MiB)
julia> @btime sortperm(x, alg=RadixSort);
7.079 s (19 allocations: 152.78 MiB) |
|
It has been noted at JuliaLang/julia#939 that
sortperm
is much slower thansort
and than some implementations in other languages. In particular, it is much slower thanforderv
from the data.table R package, even when using Radix sort as data.table does.In this PR I've done a few experiments to improve this. When vectors get very large (> 1M entries), the performance gap between
sort
(which is roughly as fast as data.table with radix sort) andsortperm
increases dramatically, probably due to cache locality issues: since the original data isn't sorted as the same time as indices, memory accesses are irregular. With some changes so that both the permutation index and the data are sorted at the same time, the ratio betweensort
andsortperm
timings can be reduced to about 2 (instead of exploding as currently). Unfortunately, this PR also slows downsortperm
for smaller vectors (below 100k entries):Note the two commits: the first one implements a cleaner and more generic approach based on a
PermVector
wrapper, which could easily be adapted to other algorithms. But is significantly slower than the second one (suggestions welcome).