Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

radix sort fallback to Base on small input #51

Conversation

LilithHafner
Copy link
Member

@LilithHafner LilithHafner commented Sep 30, 2021

Fixes #50

And defer allocation of the temporary array until after the fallback decision point because we will typically fall back to QuickSort which does not need a temporary array.

I chose 2^RADIX_SIZE because we allocate an array of that size. I validated the choice empirically with these tests:

for n in [500, 1000, 2000, 4000]
    a = @belapsed sort(x) setup=(x=rand(Int, $n)) evals=1 seconds=1
    b = @belapsed sort(x; alg=RadixSort) setup=(x=rand(Int, $n)) evals=1 seconds=1
    println(n, ": ", a/b)
end
500: 0.4460956119444834
1000: 0.9001948035520023
2000: 1.337070228643473
4000: 1.7441533572359844
for n in [500, 1000, 2000, 4000]
    a = @belapsed sort(x) setup=(x=rand(Float, $n)) evals=1 seconds=1
    b = @belapsed sort(x; alg=RadixSort) setup=(x=rand(Float, $n)) evals=1 seconds=1
    println(n, ": ", a/b)
end
500: 0.38127995053331276
1000: 0.6718377570395507
2000: 1.112666746785242
4000: 1.4809084311791743

@LilithHafner LilithHafner changed the title radix sort fallback to Base on small input (#50) radix sort fallback to Base on small input (fixes #50) Sep 30, 2021
@LilithHafner LilithHafner changed the title radix sort fallback to Base on small input (fixes #50) radix sort fallback to Base on small input Sep 30, 2021
@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2021

Codecov Report

Merging #51 (c507f37) into master (d0a9007) will decrease coverage by 0.18%.
The diff coverage is 90.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #51      +/-   ##
==========================================
- Coverage   97.84%   97.66%   -0.19%     
==========================================
  Files           1        1              
  Lines         325      342      +17     
==========================================
+ Hits          318      334      +16     
- Misses          7        8       +1     
Impacted Files Coverage Δ
src/SortingAlgorithms.jl 97.66% <90.00%> (-0.19%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0a9007...c507f37. Read the comment docs.

Copy link
Contributor

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

test/runtests.jl Outdated Show resolved Hide resolved
src/SortingAlgorithms.jl Outdated Show resolved Hide resolved
LilithHafner and others added 3 commits October 16, 2021 16:30
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
hi - lo < 2^RADIX_SIZE implies hi < lo unless we have an overflow issue, but on arrays that big we already fail b.c. bins is a UInt32 array.
function sort!(vs::AbstractVector, lo::Int, hi::Int, ::RadixSortAlg, o::Ordering, ts=similar(vs))
# Input checking
if lo >= hi; return vs; end
function sort!(vs::AbstractVector, lo::Int, hi::Int, ::RadixSortAlg, o::Ordering, ts=similar(vs, 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need ts as an argument here, right? I'm not aware of anywhere that passes their own ts, for example. I think we just move ts = similar(vs) down below the length check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use it in #33, but indeed I've no idea why it's supported here given that it's not used. Anyway better leave this for another PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK you mention this because of the addition of the length check. Indeed better move this inside the function then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of very many users of SortingAlgorithms, but I imagine one might desire to pass in ts when sorting multiple arrays to avoid allocations and save time. For example:

using SortingAlgorithms, BenchmarkTools
using Base.Sort, Base.Order
using Base.Sort: defalg
function sort_ts!(A::AbstractArray;
               dims::Integer,
               alg::Algorithm=defalg(A),
               lt=isless,
               by=identity,
               rev::Union{Bool,Nothing}=nothing,
               order::Ordering=Forward)
    ordr = ord(lt, by, rev, order)
    nd = ndims(A)
    k = dims

    1 <= k <= nd || throw(ArgumentError("dimension out of range"))

    remdims = ntuple(i -> i == k ? 1 : size(A, i), nd)
    ts = similar(A[ntuple(i -> i == k ? Colon() : 1, nd)...])
    for idx in CartesianIndices(remdims)
        Av = view(A, ntuple(i -> i == k ? Colon() : idx[i], nd)...)
        sort_ts!(Av, alg, ordr, ts)
    end
    A
end

function sort_ts!(v::AbstractVector, alg::Algorithm, order::Ordering, ts)
    inds = axes(v,1)
    sort!(v,first(inds),last(inds),alg,order,ts)
end

a = @benchmark sort!(x; dims=1) setup=(x=rand(Int, 5_000, 5_000))
display(a)
b = @benchmark sort!(x; alg=RadixSort, dims=1) setup=(x=rand(Int, 5_000, 5_000))
display(b)
c = @benchmark sort_ts!(x; alg=RadixSort, dims=1) setup=(x=rand(Int, 5_000, 5_000))
display(c)

With results

BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.159 s …  1.168 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.164 s             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.164 s ± 3.638 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                             ██                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.16 s        Histogram: frequency by time        1.17 s <

 Memory estimate: 468.75 KiB, allocs estimate: 10000.
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.210 s …    1.429 s  ┊ GC (min … max): 15.85% … 23.46%
 Time  (median):     1.332 s               ┊ GC (median):    20.54%
 Time  (mean ± σ):   1.326 s ± 110.688 ms  ┊ GC (mean ± σ):  22.18% ±  6.99%

  █          █                                         █   █  
  █▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
  1.21 s         Histogram: frequency by time         1.43 s <

 Memory estimate: 1.11 GiB, allocs estimate: 90000.
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.081 s …    1.286 s  ┊ GC (min … max): 14.12% … 26.52%
 Time  (median):     1.281 s               ┊ GC (median):    25.15%
 Time  (mean ± σ):   1.232 s ± 100.796 ms  ┊ GC (mean ± σ):  24.24% ±  7.18%

  ▁                                                       █▁  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██ ▁
  1.08 s         Histogram: frequency by time         1.29 s <

 Memory estimate: 945.74 MiB, allocs estimate: 80004.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it's not clear whether this is part of the public API or not. In doubt, maybe we should keep it to avoid breaking things.

But I don't think we should silently resize ts disregarding its size. The cleanest solution would probably to use nothing as the default. Otherwise, we could allow 0-length ts as a special case, but that's a bit less clean (and that's one more allocation).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the now-current edb805c version is as strict as before the PR: throws a bounds error if ts or vs fail to contain all of lo:hi and otherwise messes with the contents of ts but not its size.

I also think the current API is identical to the old: one may pass in a correctly sized ts or leave that argument off.

Theoretically, the only changes this PR makes are:

  1. when hi-lo is small radix sort is not run, and no allocations are made.
  2. when ts needs to be allocated (almost always) an OffsetVector spanning lo:hi is allocated instead of a full similar(vs)
  3. explicit low cost bounds checks at the beginning of the sorting algorithm that should cover all subsequent bounds errors (though no additional @inbounds are added)

Perhaps 2 and 3 belong in a different PR, but I think both are reasonable (assuming OffsetVectors are implemented as efficiently as Vectors)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of an OffsetArray is free:

julia> @benchmark sort!(x; alg=RadixSort) setup=(x=rand(20_000))
BenchmarkTools.Trial: 7600 samples with 1 evaluation.
 Range (min  max):  421.061 μs    5.636 ms  ┊ GC (min  max): 0.00%  89.58%
 Time  (median):     512.915 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   551.441 μs ± 340.750 μs  ┊ GC (mean ± σ):  4.52% ±  6.63%

     ▁▂▂▂▁▃▄▆██▇▅▃▂▂▂▁▁▁▁                         ▁             ▂
  ▅▆▆███████████████████████▇██▇▇▆▇▆▆▇▇▆▅▆▅▅▆▅▅▅▆▇█▇▇▄▃▃▄▄▂▄▂▄▅ █
  421 μs        Histogram: log(frequency) by time        871 μs <

 Memory estimate: 349.91 KiB, allocs estimate: 16.

julia> @benchmark sort!(x; alg=RadixSort) setup=(x=OffsetArray(rand(20_000), 1:20_000))
BenchmarkTools.Trial: 7511 samples with 1 evaluation.
 Range (min  max):  416.961 μs    6.288 ms  ┊ GC (min  max): 0.00%  90.21%
 Time  (median):     511.857 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   553.784 μs ± 364.445 μs  ┊ GC (mean ± σ):  4.80% ±  6.65%

     ▁▁▂▁▁   ▄▇██▇▆▅▄▃▂▂▂▂▁▁▁▁                                  ▂
  ▃▂▆██████▇█████████████████████▇███▇▆▆▆▇▇▆▅▆▅▆▆▅▅▆▅▅▄▆▆██▇▇▆▅ █
  417 μs        Histogram: log(frequency) by time        816 μs <

 Memory estimate: 349.91 KiB, allocs estimate: 16.

julia> @benchmark sort!(x) setup=(x=OffsetArray(rand(20_000), 1:20_000))
BenchmarkTools.Trial: 4266 samples with 1 evaluation.
 Range (min  max):  991.905 μs   1.702 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):       1.053 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.071 ms ± 65.684 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▃▄▆▇██▇▆▃▂▂▃▂▂▂▁▁▁                                       ▁
  ▅▇███████████████████████▇▇▇▇▆▆▇▆▆▆▆▇▇▇▆▅▆▆▆▆▅▆▄▄▄▄▅▅▃▃▃▃▃▃▃ █
  992 μs        Histogram: log(frequency) by time      1.38 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(x=rand(20_000))
BenchmarkTools.Trial: 4143 samples with 1 evaluation.
 Range (min  max):  1.005 ms   2.248 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.078 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.103 ms ± 85.456 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▄▅▆██▇▅▂▂▃▂▂▁▁▁          ▁▁▁                           ▁
  ▃▆▇███████████████████▇▇▆▇██████▇▆▇▆▇▆▅▆▃▆▅▅▃▅▅▃▄▃▅▅▅▄▅▃▃▆ █
  1.01 ms      Histogram: log(frequency) by time     1.49 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an OffsetArray is clever. Though that will only make a significant difference if lo:hi covers a relatively small part of vs. AFAICT this will only happen when there are lots of NaNs (via fpsort!), right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is also the only case I'm aware of. Do you still think it's worth it? It adds a dependency on OffsetArrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not if that's the only situation where it happens.

src/SortingAlgorithms.jl Outdated Show resolved Hide resolved
Comment on lines +71 to +92
function sort!(vs::AbstractVector, lo::Int, hi::Int, a::RadixSortAlg, o::Ordering)
if hi <= lo
vs
elseif hi - lo < RADIX_SMALL_THRESHOLD
sort!(vs, lo, hi, Base.Sort.defalg(vs), o)
else
_sort!(vs, lo, hi, a, o, similar(vs, lo:hi))
end
end
function sort!(vs::AbstractVector{T}, lo::Int, hi::Int, a::RadixSortAlg, o::Ordering, ts::AbstractVector{T}) where T
if hi <= lo
vs
elseif hi - lo < RADIX_SMALL_THRESHOLD
checkbounds(ts, lo:hi) # Consistently throw an error for insufficient ts.
sort!(vs, lo, hi, Base.Sort.defalg(vs), o)
else
_sort!(vs, lo, hi, a, o, ts)
end
end
function _sort!(vs::AbstractVector{T}, lo::Int, hi::Int, a::RadixSortAlg, o::Ordering, ts::AbstractVector{T}) where T
checkbounds(vs, lo:hi)
checkbounds(ts, lo:hi)
Copy link
Contributor

@nalimilan nalimilan Oct 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you simplify all this by taking the following approach (not tested)?

Suggested change
function sort!(vs::AbstractVector, lo::Int, hi::Int, a::RadixSortAlg, o::Ordering)
if hi <= lo
vs
elseif hi - lo < RADIX_SMALL_THRESHOLD
sort!(vs, lo, hi, Base.Sort.defalg(vs), o)
else
_sort!(vs, lo, hi, a, o, similar(vs, lo:hi))
end
end
function sort!(vs::AbstractVector{T}, lo::Int, hi::Int, a::RadixSortAlg, o::Ordering, ts::AbstractVector{T}) where T
if hi <= lo
vs
elseif hi - lo < RADIX_SMALL_THRESHOLD
checkbounds(ts, lo:hi) # Consistently throw an error for insufficient ts.
sort!(vs, lo, hi, Base.Sort.defalg(vs), o)
else
_sort!(vs, lo, hi, a, o, ts)
end
end
function _sort!(vs::AbstractVector{T}, lo::Int, hi::Int, a::RadixSortAlg, o::Ordering, ts::AbstractVector{T}) where T
checkbounds(vs, lo:hi)
checkbounds(ts, lo:hi)
function sort!(vs::AbstractVector{T}, lo::Int, hi::Int, a::RadixSortAlg, o::Ordering,
ts::Union{AbstractVector{T}, Nothing}=nothing) where T
if hi <= lo
return vs
elseif hi - lo < RADIX_SMALL_THRESHOLD
return sort!(vs, lo, hi, Base.Sort.defalg(vs), o)
elseif ts === nothing
ts = similar(vs, lo:hi)
end
checkbounds(vs, lo:hi)
checkbounds(ts, lo:hi)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I didn't want to expose sort!(... ::Nothing) for potential compatibility issues, but if this is internal &/or we are okay with declaring sort!(... ::Nothing), then that would be simpler. Is it okay to expand the interface in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also put the type check before the fallback & add a ts boundscheck on the fallback if ts is provided to give consistent errors. (Not require users to test large arrays to ensure they are getting errors)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it's okay to expand the interface in this way, I'll make the changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan, do you think it's okay to expand the interface in this way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that's fine.

function sort!(vs::AbstractVector, lo::Int, hi::Int, ::RadixSortAlg, o::Ordering, ts=similar(vs))
# Input checking
if lo >= hi; return vs; end
function sort!(vs::AbstractVector, lo::Int, hi::Int, ::RadixSortAlg, o::Ordering, ts=similar(vs, 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an OffsetArray is clever. Though that will only make a significant difference if lo:hi covers a relatively small part of vs. AFAICT this will only happen when there are lots of NaNs (via fpsort!), right?

LilithHafner and others added 4 commits October 23, 2021 09:52
Defer this to a separate PR
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@LilithHafner
Copy link
Member Author

With radix sort in Base, I'm not as interested in maintaining RadixSort here. If someone has a use case for radix sort that isn't covered by Base, ping me, but until then I'm going to close this in favor of JuliaLang/julia#44230.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Radix Sort Performance Issue (100x slower than Base)
4 participants