Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sortperm faster for large vectors #33

Closed
wants to merge 4 commits into from
Closed

Make sortperm faster for large vectors #33

wants to merge 4 commits into from

Conversation

nalimilan
Copy link
Contributor

@nalimilan nalimilan commented Nov 17, 2019

It has been noted at JuliaLang/julia#939 that sortperm is much slower than sort and than some implementations in other languages. In particular, it is much slower than forderv from the data.table R package, even when using Radix sort as data.table does.

In this PR I've done a few experiments to improve this. When vectors get very large (> 1M entries), the performance gap between sort (which is roughly as fast as data.table with radix sort) and sortperm increases dramatically, probably due to cache locality issues: since the original data isn't sorted as the same time as indices, memory accesses are irregular. With some changes so that both the permutation index and the data are sorted at the same time, the ratio between sort and sortperm timings can be reduced to about 2 (instead of exploding as currently). Unfortunately, this PR also slows down sortperm for smaller vectors (below 100k entries):

julia> using SortingAlgorithms, BenchmarkTools

## 10k entries
julia> x = rand(Int, 10_000);

julia> @btime sort(x, alg=RadixSort);
  385.303 μs (18 allocations: 349.98 KiB)

# Before
julia> @btime sortperm(x, alg=RadixSort);
  451.645 μs (19 allocations: 350.00 KiB)

# After commit 1
julia> @btime sortperm(x, alg=RadixSort);
  717.036 μs (23 allocations: 476.27 KiB)

# After commit 2
julia> @btime sortperm(x, alg=RadixSort);
  566.928 μs (23 allocations: 506.41 KiB)


## 100k entries
julia> x = rand(Int, 100_000);

julia> @btime sort(x, alg=RadixSort);
  5.135 ms (18 allocations: 1.72 MiB)

# Before
julia> @btime sortperm(x, alg=RadixSort);
  6.531 ms (19 allocations: 1.72 MiB)

# After commit 1
julia> @btime sortperm(x, alg=RadixSort);
  9.459 ms (23 allocations: 2.53 MiB)

# After commit 2
julia> @btime sortperm(x, alg=RadixSort);
  8.082 ms (23 allocations: 3.24 MiB)


## 1M entries
julia> x = rand(Int, 1_000_000);

julia> @btime sort(x, alg=RadixSort);
  56.841 ms (18 allocations: 15.45 MiB)

# Before
julia> @btime sortperm(x, alg=RadixSort);
  437.113 ms (19 allocations: 15.45 MiB)

# After commit 1
julia> @btime sortperm(x, alg=RadixSort);
  113.574 ms (23 allocations: 23.12 MiB)

# After commit 2
julia> @btime sortperm(x, alg=RadixSort);
  98.436 ms (23 allocations: 30.71 MiB)


## 10M entries
julia> x = rand(Int, 10_000_000);

julia> @btime sort(x, alg=RadixSort);
  723.494 ms (18 allocations: 152.78 MiB)

# Before
julia> @btime sortperm(x, alg=RadixSort);
  6.239 s (19 allocations: 152.78 MiB)

# After commit 1
julia> @btime sortperm(x, alg=RadixSort);
  1.335 s (23 allocations: 229.12 MiB)

# After commit 2
julia> @btime sortperm(x, alg=RadixSort);
  1.306 s (23 allocations: 305.37 MiB)

Note the two commits: the first one implements a cleaner and more generic approach based on a PermVector wrapper, which could easily be adapted to other algorithms. But is significantly slower than the second one (suggestions welcome).

Using a custom PermVector which ensures that both indices and values
are sorted at the same time ensures cache locality.
This approach is faster than PermVector.
@nalimilan nalimilan changed the title Experimentations to make sortperm faster RFC: Experimentations to make sortperm faster Nov 17, 2019
@ViralBShah
Copy link

ViralBShah commented Nov 17, 2019

Do we have test cases for non-random inputs? It perhaps begs the question what other cases should one test on.

@JeffreySarnoff
Copy link

Unfortunately, this PR also slows down sortperm for smaller vectors (below 100k entries)

sortperm(x, _etc_) = length(x) < 100_000 ? sortperm_smaller(x, _etc_) : sortperm_larger(x, _etc_)

trait-based off e.g. abstract type ShortVectorToSort{T} <: Vector{T} end if that helps

@nalimilan
Copy link
Contributor Author

Do we have test cases for non-random inputs? It perhaps begs the question what other cases should one test on.

Yes, clearly we need more benchmarks, especially if we have to identify a size threshold.

Unfortunately, this PR also slows down sortperm for smaller vectors (below 100k entries)

sortperm(x, _etc_) = length(x) < 100_000 ? sortperm_smaller(x, _etc_) : sortperm_larger(x, _etc_)

trait-based off e.g. abstract type ShortVectorToSort{T} <: Vector{T} end if that helps

Sure, taking two different methods based on a threshold is always possible. But it's not great in terms of code simplicity.

I should have noted that somehow data.table's forderv is still faster than sortperm even with this PR (by about 15% for the 1M case, and 30% for the 10M case).

@JeffreySarnoff
Copy link

it's not great in terms of code simplicity (agreed)

My note presupposed that this is a widely used and for certain classes of application, unusually performance-sensitive function. The estimated algorithmic split length appears sufficiently high that a sizeable count of uses of the function would incur that introduced slowness. Simultaneously, the import of the introduced fastness for very large vectors is similarly significant. I am not aware of an internally adaptive approach that is relatively lightweight. There certainly may be ..

@xiaodaigh
Copy link

For reference, I have implemented faster sortperm in the form of fsortperm in SortingLab.jl.

The performance difference is stark for Vector{Union{Missing, T}}, see https://discourse.julialang.org/t/ann-sortinglab-jl-0-2-2-5x-faster-sort-and-sort-perm-for-vector-union-missing-t/31704

@nalimilan
Copy link
Contributor Author

Thanks. I've just tried that, and for Vector{Int} it seems to be about as fast as this PR for very large inputs, but much slower than this PR (and therefore than the current state) for smaller inputs. Though they have the same performance if the number of unique values is small (100). Actually, even the current state is as fast as this PR and as fsortperm in that case, disregarding the vector length.

julia> x = rand(-1000_000:1000_000, 10_000);

# Before
julia> @btime sortperm(x, alg=RadixSort);
  395.240 μs (19 allocations: 350.00 KiB)

# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
  403.349 μs (23 allocations: 506.41 KiB)

julia> @btime fsortperm(x);
  953.816 μs (19 allocations: 2.31 MiB)

julia> x = rand(-1000_000:1000_000, 100_000);

# Before
julia> @btime sortperm(x, alg=RadixSort);
  4.356 ms (19 allocations: 1.72 MiB)

# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
  4.607 ms (23 allocations: 3.24 MiB)

julia> @btime fsortperm(x);
  4.982 ms (19 allocations: 5.05 MiB)

julia> x = rand(-1000_000:1000_000, 1000_000);

# Before
julia> @btime sortperm(x, alg=RadixSort);
  272.243 ms (19 allocations: 15.45 MiB)

# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
  54.644 ms (23 allocations: 30.71 MiB)

julia> @btime fsortperm(x);
  56.045 ms (19 allocations: 32.52 MiB)

julia> x = rand(-10_000_000:10_000_000, 10_000_000);

# Before
julia> @btime sortperm(x, alg=RadixSort);
  7.812 s (19 allocations: 152.78 MiB)

# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
  735.583 ms (23 allocations: 305.37 MiB)

julia> @btime fsortperm(x);
  682.264 ms (19 allocations: 307.18 MiB)

(BTW, I had to use -1000_000:1000_000 for fsortperm to work since there's a bug due to overflow when checking the range with rand(Int, n).)

Can you explain what's the approach taken by fsortperm in a few words?

@xiaodaigh
Copy link

BTW, I had to use -1000_000:1000_000 for fsortperm to work since there's a bug due to overflow when checking the range with rand(Int, n)

Oh, there is a bug? I might look at it closely later. But please submit a bug report if you find more bugs.

Can you explain what's the approach taken by fsortperm in a few words?
For most types, fsortperm works by modifying the radix algorithm so that it sorts the original vector but also rearranges an index at the same time. In effect, it's a more efficient version of this

fsortperm(x) = begin
  # not the actual implementation of `fsortperm`, just an illustration
  x_and_index = collect(zip(x, 1:length(x)))
  sort!(x_and_index, by = x->x[1]) # sort by x, but index gets shuffled around too
  map(x->x[2], x_and_index) # return the index only
end

Actually, the original use case, I had was I wanted to sort the column and carry an index as well, so that after sorting it, I have the sorted column and a sortperm index. So I can rearrange DataFrames more efficiently this way.

The real power of fsortperm comes from being able to sort Vector{Union{T, Missing}} as well

@nalimilan
Copy link
Contributor Author

Oh, there is a bug? I might look at it closely later. But please submit a bug report if you find more bugs.

Yes I think you need to handle overflow (see how Base.sort does this).

For most types, fsortperm works by modifying the radix algorithm so that it sorts the original vector but also rearranges an index at the same time. In effect, it's a more efficient version of this

OK, so that's essentially what this PR does, right?

@xiaodaigh
Copy link

Oh, there is a bug? I might look at it closely later. But please submit a bug report if you find more bugs.

Yes I think you need to handle overflow (see how Base.sort does this).

For most types, fsortperm works by modifying the radix algorithm so that it sorts the original vector but also rearranges an index at the same time. In effect, it's a more efficient version of this

OK, so that's essentially what this PR does, right?

Sounds like it

@nalimilan
Copy link
Contributor Author

I'd like to revive this so that we can get fast radix sort for grouping in DataFrames. Maybe we could even use radix sort in Base for some types, like R does for some time.

I've added a commit so that below a size of 2MB the current approach is still used, and we switch to the next approach only above that. That way we get the best of both worlds according to my benchmarks (compare below with OP). I also checked other types, and the threshold works well, except for Int8 and UInt8 where there's no gain even for a 10GB vector (this is probably because a single pass is needed for this type). So I excluded these types. I've also tested various patterns, and I see either the same speedup (reverse sorted input) or no difference (sorted input). (When all values are in a small range, Base calls sortperm_int_range directly so our method isn't used at all.)

I've also reverted to the cleaner approach of introducing PermVector, as it's as fast as the more hacky solution for large vectors.

julia> using SortingAlgorithms, BenchmarkTools

julia> x = rand(Int, 10_000);

julia> @btime sort(x, alg=RadixSort);

  401.476 μs (18 allocations: 349.98 KiB)

julia> @btime sortperm(x, alg=RadixSort);
  466.175 μs (19 allocations: 350.00 KiB)

julia> x = rand(Int, 100_000);

julia> @btime sort(x, alg=RadixSort);
  5.048 ms (18 allocations: 1.72 MiB)

julia> @btime sortperm(x, alg=RadixSort);
  6.344 ms (19 allocations: 1.72 MiB)

julia> x = rand(Int, 1_000_000);

julia> @btime sort(x, alg=RadixSort);
  56.672 ms (18 allocations: 15.45 MiB)

julia> @btime sortperm(x, alg=RadixSort);
  95.381 ms (25 allocations: 38.34 MiB)

julia> x = rand(Int, 10_000_000);

julia> @btime sort(x, alg=RadixSort);
  677.512 ms (18 allocations: 152.78 MiB)

julia> @btime sortperm(x, alg=RadixSort);
  1.188 s (25 allocations: 381.66 MiB)

(Note CI won't pass before #37 is merged.)

@nalimilan nalimilan changed the title RFC: Experimentations to make sortperm faster Make sortperm faster for large vectors Sep 28, 2020
@JeffreySarnoff
Copy link

thank you

Copy link

@xiaodaigh xiaodaigh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update and merge with Project.toml. Can get this merged?

@@ -42,6 +42,27 @@ end

## Radix sort

struct PermElement{T}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comment for what this is for?

@nalimilan
Copy link
Contributor Author

On Julia master, thanks to JuliaLang/julia#44230, sort with the default algorithm is as fast as RadixSort from SortingAlgorithms. sortperm with the default algorithm is roughly as fast as RadixSort on this PR (and much faster than RadixSort on SortingAlgotihms master) for very large vectors; but for vectors below 1 million entries, the default algorithm is significantly slower than RadixSort. @LilithHafner Any idea whether this could be improved?

julia> using SortingAlgorithms, BenchmarkTools

## 10k entries
julia> x = rand(Int, 10_000);

julia> @btime sort(x);
  177.246 μs (6 allocations: 164.56 KiB)

julia> @btime sort(x, alg=RadixSort);
  178.927 μs (18 allocations: 349.89 KiB)

julia> @btime sortperm(x);
  508.140 μs (3 allocations: 78.19 KiB)

julia> @btime sortperm(x, alg=RadixSort);
  221.536 μs (19 allocations: 349.91 KiB)


## 100k entries
julia> x = rand(Int, 100_000);

julia> @btime sort(x);
  1.934 ms (6 allocations: 1.53 MiB)

julia> @btime sort(x, alg=RadixSort);
  1.828 ms (18 allocations: 1.71 MiB)

julia> @btime sortperm(x);
  7.118 ms (3 allocations: 781.31 KiB)

julia> @btime sortperm(x, alg=RadixSort);
  2.507 ms (19 allocations: 1.71 MiB)


## 1M entries
julia> x = rand(Int, 1_000_000);

julia> @btime sort(x);
  26.452 ms (6 allocations: 15.27 MiB)

julia> @btime sort(x, alg=RadixSort);
  23.970 ms (18 allocations: 15.45 MiB)

julia> @btime sortperm(x);
  120.149 ms (3 allocations: 7.63 MiB)

julia> @btime sortperm(x, alg=RadixSort);
  302.748 ms (19 allocations: 15.45 MiB)


## 10M entries
julia> x = rand(Int, 10_000_000);

julia> @btime sort(x);
  326.074 ms (6 allocations: 152.60 MiB)

julia> @btime sort(x, alg=RadixSort);
  320.109 ms (18 allocations: 152.78 MiB)

julia> @btime sortperm(x);
  1.977 s (3 allocations: 76.29 MiB)

julia> @btime sortperm(x, alg=RadixSort);
  7.079 s (19 allocations: 152.78 MiB)

@LilithHafner
Copy link
Member

sortperm(x) is slower than sortperm(x, alg=RadixSort) because sortperm(x) does not use radix sort. It doesn't use radix sort because it utilizes sortperm!(inds, x) which guarantees stability even if inds is initialized out of order, and radix sort cannot efficiently meet that guarantee. I plan to fix this soon.

@LilithHafner LilithHafner deleted the branch JuliaCollections:master October 22, 2023 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants