Make sortperm faster for large vectors #33

nalimilan · 2019-11-17T17:04:46Z

It has been noted at JuliaLang/julia#939 that sortperm is much slower than sort and than some implementations in other languages. In particular, it is much slower than forderv from the data.table R package, even when using Radix sort as data.table does.

In this PR I've done a few experiments to improve this. When vectors get very large (> 1M entries), the performance gap between sort (which is roughly as fast as data.table with radix sort) and sortperm increases dramatically, probably due to cache locality issues: since the original data isn't sorted as the same time as indices, memory accesses are irregular. With some changes so that both the permutation index and the data are sorted at the same time, the ratio between sort and sortperm timings can be reduced to about 2 (instead of exploding as currently). Unfortunately, this PR also slows down sortperm for smaller vectors (below 100k entries):

julia> using SortingAlgorithms, BenchmarkTools

## 10k entries
julia> x = rand(Int, 10_000);

julia> @btime sort(x, alg=RadixSort);
  385.303 μs (18 allocations: 349.98 KiB)

# Before
julia> @btime sortperm(x, alg=RadixSort);
  451.645 μs (19 allocations: 350.00 KiB)

# After commit 1
julia> @btime sortperm(x, alg=RadixSort);
  717.036 μs (23 allocations: 476.27 KiB)

# After commit 2
julia> @btime sortperm(x, alg=RadixSort);
  566.928 μs (23 allocations: 506.41 KiB)


## 100k entries
julia> x = rand(Int, 100_000);

julia> @btime sort(x, alg=RadixSort);
  5.135 ms (18 allocations: 1.72 MiB)

# Before
julia> @btime sortperm(x, alg=RadixSort);
  6.531 ms (19 allocations: 1.72 MiB)

# After commit 1
julia> @btime sortperm(x, alg=RadixSort);
  9.459 ms (23 allocations: 2.53 MiB)

# After commit 2
julia> @btime sortperm(x, alg=RadixSort);
  8.082 ms (23 allocations: 3.24 MiB)


## 1M entries
julia> x = rand(Int, 1_000_000);

julia> @btime sort(x, alg=RadixSort);
  56.841 ms (18 allocations: 15.45 MiB)

# Before
julia> @btime sortperm(x, alg=RadixSort);
  437.113 ms (19 allocations: 15.45 MiB)

# After commit 1
julia> @btime sortperm(x, alg=RadixSort);
  113.574 ms (23 allocations: 23.12 MiB)

# After commit 2
julia> @btime sortperm(x, alg=RadixSort);
  98.436 ms (23 allocations: 30.71 MiB)


## 10M entries
julia> x = rand(Int, 10_000_000);

julia> @btime sort(x, alg=RadixSort);
  723.494 ms (18 allocations: 152.78 MiB)

# Before
julia> @btime sortperm(x, alg=RadixSort);
  6.239 s (19 allocations: 152.78 MiB)

# After commit 1
julia> @btime sortperm(x, alg=RadixSort);
  1.335 s (23 allocations: 229.12 MiB)

# After commit 2
julia> @btime sortperm(x, alg=RadixSort);
  1.306 s (23 allocations: 305.37 MiB)

Note the two commits: the first one implements a cleaner and more generic approach based on a PermVector wrapper, which could easily be adapted to other algorithms. But is significantly slower than the second one (suggestions welcome).

Using a custom PermVector which ensures that both indices and values are sorted at the same time ensures cache locality.

This approach is faster than PermVector.

ViralBShah · 2019-11-17T17:15:06Z

Do we have test cases for non-random inputs? It perhaps begs the question what other cases should one test on.

JeffreySarnoff · 2019-11-17T18:11:58Z

Unfortunately, this PR also slows down sortperm for smaller vectors (below 100k entries)

sortperm(x, _etc_) = length(x) < 100_000 ? sortperm_smaller(x, _etc_) : sortperm_larger(x, _etc_)

trait-based off e.g. abstract type ShortVectorToSort{T} <: Vector{T} end if that helps

nalimilan · 2019-11-17T20:47:20Z

Do we have test cases for non-random inputs? It perhaps begs the question what other cases should one test on.

Yes, clearly we need more benchmarks, especially if we have to identify a size threshold.

Unfortunately, this PR also slows down sortperm for smaller vectors (below 100k entries)
sortperm(x, _etc_) = length(x) < 100_000 ? sortperm_smaller(x, _etc_) : sortperm_larger(x, _etc_)
trait-based off e.g. abstract type ShortVectorToSort{T} <: Vector{T} end if that helps

Sure, taking two different methods based on a threshold is always possible. But it's not great in terms of code simplicity.

I should have noted that somehow data.table's forderv is still faster than sortperm even with this PR (by about 15% for the 1M case, and 30% for the 10M case).

JeffreySarnoff · 2019-11-17T21:33:15Z

it's not great in terms of code simplicity (agreed)

My note presupposed that this is a widely used and for certain classes of application, unusually performance-sensitive function. The estimated algorithmic split length appears sufficiently high that a sizeable count of uses of the function would incur that introduced slowness. Simultaneously, the import of the introduced fastness for very large vectors is similarly significant. I am not aware of an internally adaptive approach that is relatively lightweight. There certainly may be ..

xiaodaigh · 2019-12-01T00:20:06Z

For reference, I have implemented faster sortperm in the form of fsortperm in SortingLab.jl.

The performance difference is stark for Vector{Union{Missing, T}}, see https://discourse.julialang.org/t/ann-sortinglab-jl-0-2-2-5x-faster-sort-and-sort-perm-for-vector-union-missing-t/31704

nalimilan · 2019-12-01T10:48:28Z

Thanks. I've just tried that, and for Vector{Int} it seems to be about as fast as this PR for very large inputs, but much slower than this PR (and therefore than the current state) for smaller inputs. Though they have the same performance if the number of unique values is small (100). Actually, even the current state is as fast as this PR and as fsortperm in that case, disregarding the vector length.

julia> x = rand(-1000_000:1000_000, 10_000);

# Before
julia> @btime sortperm(x, alg=RadixSort);
  395.240 μs (19 allocations: 350.00 KiB)

# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
  403.349 μs (23 allocations: 506.41 KiB)

julia> @btime fsortperm(x);
  953.816 μs (19 allocations: 2.31 MiB)

julia> x = rand(-1000_000:1000_000, 100_000);

# Before
julia> @btime sortperm(x, alg=RadixSort);
  4.356 ms (19 allocations: 1.72 MiB)

# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
  4.607 ms (23 allocations: 3.24 MiB)

julia> @btime fsortperm(x);
  4.982 ms (19 allocations: 5.05 MiB)

julia> x = rand(-1000_000:1000_000, 1000_000);

# Before
julia> @btime sortperm(x, alg=RadixSort);
  272.243 ms (19 allocations: 15.45 MiB)

# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
  54.644 ms (23 allocations: 30.71 MiB)

julia> @btime fsortperm(x);
  56.045 ms (19 allocations: 32.52 MiB)

julia> x = rand(-10_000_000:10_000_000, 10_000_000);

# Before
julia> @btime sortperm(x, alg=RadixSort);
  7.812 s (19 allocations: 152.78 MiB)

# Second commit in this PR
julia> @btime sortperm(x, alg=RadixSort);
  735.583 ms (23 allocations: 305.37 MiB)

julia> @btime fsortperm(x);
  682.264 ms (19 allocations: 307.18 MiB)

(BTW, I had to use -1000_000:1000_000 for fsortperm to work since there's a bug due to overflow when checking the range with rand(Int, n).)

Can you explain what's the approach taken by fsortperm in a few words?

xiaodaigh · 2019-12-01T11:14:51Z

BTW, I had to use -1000_000:1000_000 for fsortperm to work since there's a bug due to overflow when checking the range with rand(Int, n)

Oh, there is a bug? I might look at it closely later. But please submit a bug report if you find more bugs.

Can you explain what's the approach taken by fsortperm in a few words?
For most types, fsortperm works by modifying the radix algorithm so that it sorts the original vector but also rearranges an index at the same time. In effect, it's a more efficient version of this

fsortperm(x) = begin
  # not the actual implementation of `fsortperm`, just an illustration
  x_and_index = collect(zip(x, 1:length(x)))
  sort!(x_and_index, by = x->x[1]) # sort by x, but index gets shuffled around too
  map(x->x[2], x_and_index) # return the index only
end

Actually, the original use case, I had was I wanted to sort the column and carry an index as well, so that after sorting it, I have the sorted column and a sortperm index. So I can rearrange DataFrames more efficiently this way.

The real power of fsortperm comes from being able to sort Vector{Union{T, Missing}} as well

nalimilan · 2019-12-01T11:42:14Z

Oh, there is a bug? I might look at it closely later. But please submit a bug report if you find more bugs.

Yes I think you need to handle overflow (see how Base.sort does this).

For most types, fsortperm works by modifying the radix algorithm so that it sorts the original vector but also rearranges an index at the same time. In effect, it's a more efficient version of this

OK, so that's essentially what this PR does, right?

xiaodaigh · 2019-12-01T13:35:51Z

Oh, there is a bug? I might look at it closely later. But please submit a bug report if you find more bugs.

Yes I think you need to handle overflow (see how Base.sort does this).

For most types, fsortperm works by modifying the radix algorithm so that it sorts the original vector but also rearranges an index at the same time. In effect, it's a more efficient version of this

OK, so that's essentially what this PR does, right?

Sounds like it

This reverts commit 2de8220.

nalimilan · 2020-09-28T12:02:46Z

I'd like to revive this so that we can get fast radix sort for grouping in DataFrames. Maybe we could even use radix sort in Base for some types, like R does for some time.

I've added a commit so that below a size of 2MB the current approach is still used, and we switch to the next approach only above that. That way we get the best of both worlds according to my benchmarks (compare below with OP). I also checked other types, and the threshold works well, except for Int8 and UInt8 where there's no gain even for a 10GB vector (this is probably because a single pass is needed for this type). So I excluded these types. I've also tested various patterns, and I see either the same speedup (reverse sorted input) or no difference (sorted input). (When all values are in a small range, Base calls sortperm_int_range directly so our method isn't used at all.)

I've also reverted to the cleaner approach of introducing PermVector, as it's as fast as the more hacky solution for large vectors.

julia> using SortingAlgorithms, BenchmarkTools

julia> x = rand(Int, 10_000);

julia> @btime sort(x, alg=RadixSort);

  401.476 μs (18 allocations: 349.98 KiB)

julia> @btime sortperm(x, alg=RadixSort);
  466.175 μs (19 allocations: 350.00 KiB)

julia> x = rand(Int, 100_000);

julia> @btime sort(x, alg=RadixSort);
  5.048 ms (18 allocations: 1.72 MiB)

julia> @btime sortperm(x, alg=RadixSort);
  6.344 ms (19 allocations: 1.72 MiB)

julia> x = rand(Int, 1_000_000);

julia> @btime sort(x, alg=RadixSort);
  56.672 ms (18 allocations: 15.45 MiB)

julia> @btime sortperm(x, alg=RadixSort);
  95.381 ms (25 allocations: 38.34 MiB)

julia> x = rand(Int, 10_000_000);

julia> @btime sort(x, alg=RadixSort);
  677.512 ms (18 allocations: 152.78 MiB)

julia> @btime sortperm(x, alg=RadixSort);
  1.188 s (25 allocations: 381.66 MiB)

(Note CI won't pass before #37 is merged.)

JeffreySarnoff · 2020-09-28T12:54:54Z

thank you

xiaodaigh

update and merge with Project.toml. Can get this merged?

xiaodaigh · 2020-10-02T15:32:18Z

src/SortingAlgorithms.jl

@@ -42,6 +42,27 @@ end

 ## Radix sort

+struct PermElement{T}


add some comment for what this is for?

nalimilan · 2022-04-03T16:42:35Z

On Julia master, thanks to JuliaLang/julia#44230, sort with the default algorithm is as fast as RadixSort from SortingAlgorithms. sortperm with the default algorithm is roughly as fast as RadixSort on this PR (and much faster than RadixSort on SortingAlgotihms master) for very large vectors; but for vectors below 1 million entries, the default algorithm is significantly slower than RadixSort. @LilithHafner Any idea whether this could be improved?

julia> using SortingAlgorithms, BenchmarkTools

## 10k entries
julia> x = rand(Int, 10_000);

julia> @btime sort(x);
  177.246 μs (6 allocations: 164.56 KiB)

julia> @btime sort(x, alg=RadixSort);
  178.927 μs (18 allocations: 349.89 KiB)

julia> @btime sortperm(x);
  508.140 μs (3 allocations: 78.19 KiB)

julia> @btime sortperm(x, alg=RadixSort);
  221.536 μs (19 allocations: 349.91 KiB)


## 100k entries
julia> x = rand(Int, 100_000);

julia> @btime sort(x);
  1.934 ms (6 allocations: 1.53 MiB)

julia> @btime sort(x, alg=RadixSort);
  1.828 ms (18 allocations: 1.71 MiB)

julia> @btime sortperm(x);
  7.118 ms (3 allocations: 781.31 KiB)

julia> @btime sortperm(x, alg=RadixSort);
  2.507 ms (19 allocations: 1.71 MiB)


## 1M entries
julia> x = rand(Int, 1_000_000);

julia> @btime sort(x);
  26.452 ms (6 allocations: 15.27 MiB)

julia> @btime sort(x, alg=RadixSort);
  23.970 ms (18 allocations: 15.45 MiB)

julia> @btime sortperm(x);
  120.149 ms (3 allocations: 7.63 MiB)

julia> @btime sortperm(x, alg=RadixSort);
  302.748 ms (19 allocations: 15.45 MiB)


## 10M entries
julia> x = rand(Int, 10_000_000);

julia> @btime sort(x);
  326.074 ms (6 allocations: 152.60 MiB)

julia> @btime sort(x, alg=RadixSort);
  320.109 ms (18 allocations: 152.78 MiB)

julia> @btime sortperm(x);
  1.977 s (3 allocations: 76.29 MiB)

julia> @btime sortperm(x, alg=RadixSort);
  7.079 s (19 allocations: 152.78 MiB)

LilithHafner · 2022-04-04T04:14:17Z

sortperm(x) is slower than sortperm(x, alg=RadixSort) because sortperm(x) does not use radix sort. It doesn't use radix sort because it utilizes sortperm!(inds, x) which guarantees stability even if inds is initialized out of order, and radix sort cannot efficiently meet that guarantee. I plan to fix this soon.

nalimilan added 2 commits November 17, 2019 17:04

Improve sortperm performance with RadixSort

960afbc

Using a custom PermVector which ensures that both indices and values are sorted at the same time ensures cache locality.

Add sortperm-specific instructions inside sort!

2de8220

This approach is faster than PermVector.

nalimilan changed the title ~~Experimentations to make sortperm faster~~ RFC: Experimentations to make sortperm faster Nov 17, 2019

nalimilan mentioned this pull request Nov 17, 2019

sortperm has poor performance JuliaLang/julia#939

Open

nalimilan mentioned this pull request Jul 9, 2020

added string radix sort and sortperm + tests #27

Closed

nalimilan mentioned this pull request Aug 11, 2020

Faster joins meta issue JuliaData/DataFrames.jl#2340

Closed

nalimilan added 2 commits September 27, 2020 23:19

Revert "Add sortperm-specific instructions inside sort!"

e5d6744

This reverts commit 2de8220.

Keep using current algorithm below 2MB

696201e

nalimilan changed the title ~~RFC: Experimentations to make sortperm faster~~ Make sortperm faster for large vectors Sep 28, 2020

nalimilan mentioned this pull request Oct 2, 2020

Improve sort performance JuliaData/DataFrames.jl#2455

Open

xiaodaigh suggested changes Oct 3, 2020

View reviewed changes

src/SortingAlgorithms.jl

@@ -42,6 +42,27 @@ end

## Radix sort

struct PermElement{T}

Copy link

xiaodaigh Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comment for what this is for?

nalimilan mentioned this pull request Oct 17, 2021

radix sort fallback to Base on small input #51

Closed

henry2004y mentioned this pull request Dec 17, 2022

sortperm algorithm henry2004y/Vlasiator.jl#124

Closed

LilithHafner deleted the branch JuliaCollections:master October 22, 2023 15:01

LilithHafner closed this Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sortperm faster for large vectors #33

Make sortperm faster for large vectors #33

nalimilan commented Nov 17, 2019 •

edited

Loading

ViralBShah commented Nov 17, 2019 •

edited

Loading

JeffreySarnoff commented Nov 17, 2019

nalimilan commented Nov 17, 2019

JeffreySarnoff commented Nov 17, 2019

xiaodaigh commented Dec 1, 2019

nalimilan commented Dec 1, 2019

xiaodaigh commented Dec 1, 2019

nalimilan commented Dec 1, 2019

xiaodaigh commented Dec 1, 2019

nalimilan commented Sep 28, 2020

JeffreySarnoff commented Sep 28, 2020

xiaodaigh left a comment

xiaodaigh Oct 2, 2020

nalimilan commented Apr 3, 2022

LilithHafner commented Apr 4, 2022

Make sortperm faster for large vectors #33

Make sortperm faster for large vectors #33

Conversation

nalimilan commented Nov 17, 2019 • edited Loading

ViralBShah commented Nov 17, 2019 • edited Loading

JeffreySarnoff commented Nov 17, 2019

nalimilan commented Nov 17, 2019

JeffreySarnoff commented Nov 17, 2019

xiaodaigh commented Dec 1, 2019

nalimilan commented Dec 1, 2019

xiaodaigh commented Dec 1, 2019

nalimilan commented Dec 1, 2019

xiaodaigh commented Dec 1, 2019

nalimilan commented Sep 28, 2020

JeffreySarnoff commented Sep 28, 2020

xiaodaigh left a comment

Choose a reason for hiding this comment

xiaodaigh Oct 2, 2020

Choose a reason for hiding this comment

nalimilan commented Apr 3, 2022

LilithHafner commented Apr 4, 2022

nalimilan commented Nov 17, 2019 •

edited

Loading

ViralBShah commented Nov 17, 2019 •

edited

Loading