Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PatternDefeatingQuicksort #72

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

LSchwerdt
Copy link
Contributor

@LSchwerdt LSchwerdt commented Feb 11, 2023

Add Pattern-defeating Quicksort, an unstable, O(1) space, guaranteed* O(n log n) time variant of quicksort. Both normal, and branchless (block-) partitioning schemes are implemented.

(*) by falling back to HeapSort for pathological inputs

Highlights (BranchlessPdqSort):

  • 20% - 50% faster than current default when sorting Int128s and UInt128s (while using O(1) space instead of O(n))
  • Significantly faster than default for Int64s and UInt64s for special inputs:
    • Almost presorted
    • Almost reverse presorted
    • Many duplicates, but wide value range
  • Edit: Enables sortperm using the same memory footprint as current default sortperm, by simultaneously sorting values and indices. This achieves 5x-20x speedup vs sortperm for 2 GiB of Int64s, depending on hardware. It is always faster than sortperm, except for
    • very small problem sizes (that fit in L1$),
    • medium problem sizes (that fit in L3$) when sortperm uses special optimizations (Floats, Missing).

Checklist

Open Question

  • BranchlessPdqSort is the main star of this PR. The only advantage of BranchyPdqSort is, that it uses zero allocations.
    Maybe define const PdqSort = BranchlessPdqSort, or delete BranchyPdqSort entirely?

Notes

  • Uses randomization instead of deterministic pattern breaking after an unbalanced partition.
    The implementations in Go and Rust do aswell, and the orginal author endorses that approach.
  • There are potentially faster QuickSort variants ([1],[2]), but implementing them in pure Julia is hard. /impossible?

Benchmarks

The benchmarks can be recreated using this Pluto notebook.
A script for very quick benchmarking is below at the end.

Full benchmark results (9d36ce4):

AMD 3900X, 2ch DDR4-3600 CL16 (≙ 8.9ns)
Intel 7820X, 4ch DDR4-2666 CL19 (≙14.3ns)
Intel 6700HQ, 2ch DDR4-2400 CL14 (≙11.7ns)

Here are the results from the 7820X (b603963):

Sort Int64

pdqSort_bench_Int64
pdqSort_bench_Int64_speedup

Sort Int128

pdqSort_bench_Int128
pdqSort_bench_Int128_speedup

Sort special inputs

pdqSort_bench_specialinputs

SimultaneousSortperm

A package that implements sortperm using PdqSort is here.
The benchmarking code and more results are found here.

Results (Intel 7820x)

Int64
Int64_speedup
Int64_almost_presorted
Int64_almost_presorted_speedup
Float64
Float64_speedup
Float64_by_abs2
Float64_by_abs2_speedup
Categorical

Quick benchmarking script

using BenchmarkTools, Random, SortingAlgorithms
versioninfo()
for i in 2:5
    n = 17^i
    println("sort!(rand(Int128, $n))")
    v = rand(Int128, n)
    print("Default:           "); @btime sort!($v) setup=(rand!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v; alg=QuickSort) setup=(rand!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v; alg=BranchyPdqSort) setup=(rand!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v; alg=BranchlessPdqSort) setup=(rand!($v)) evals=1
end

function setupv!(v)
    v[1:end-1] = 1:length(v)-1
    v[end] = 10
end

for i in 2:5
    n = 17^i
    println("sort!([1:$n;10])")
    v = [1:n;10]
    print("Default:           "); @btime sort!($v) setup=(setupv!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v; alg=QuickSort) setup=(setupv!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v; alg=BranchyPdqSort) setup=(setupv!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v; alg=BranchlessPdqSort) setup=(setupv!($v)) evals=1
end

for i in 2:5
    n = 17^i
    println("sort!(rand(Complex{Float64},$n), by=abs2)")
    v = rand(Complex{Float64},n)
    print("Default:           "); @btime sort!($v, by=abs2) setup=(rand!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v, by=abs2; alg=QuickSort) setup=(rand!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v, by=abs2; alg=BranchyPdqSort) setup=(rand!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v, by=abs2; alg=BranchlessPdqSort) setup=(rand!($v)) evals=1
end
Result on the 3900x (b603963)
Julia Version 1.9.0-beta4
Commit b75ddb787f (2023-02-07 21:53 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 12 on 24 virtual cores      
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12
sort!(rand(Int128, 289))
Default:             5.400 μs (1 allocation: 4.69 KiB)
Quicksort:           6.200 μs (0 allocations: 0 bytes)
BranchyPdqSort:      6.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   4.700 μs (2 allocations: 1.06 KiB)
sort!(rand(Int128, 4913))
Default:             119.900 μs (2 allocations: 76.86 KiB)
Quicksort:           165.500 μs (0 allocations: 0 bytes)
BranchyPdqSort:      171.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   103.500 μs (2 allocations: 1.06 KiB)
sort!(rand(Int128, 83521))
Default:             2.711 ms (2 allocations: 1.27 MiB)
Quicksort:           4.190 ms (0 allocations: 0 bytes)
BranchyPdqSort:      4.260 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   2.294 ms (2 allocations: 1.06 KiB)
sort!(rand(Int128, 1419857))
Default:             60.424 ms (2 allocations: 21.67 MiB)
Quicksort:           89.065 ms (0 allocations: 0 bytes)
BranchyPdqSort:      91.504 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   46.587 ms (2 allocations: 1.06 KiB)
sort!([1:289;10])
Default:             1.200 μs (2 allocations: 2.77 KiB)
Quicksort:           1.200 μs (0 allocations: 0 bytes)
BranchyPdqSort:      1.000 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   1.200 μs (2 allocations: 1.06 KiB)
sort!([1:4913;10])
Default:             19.900 μs (3 allocations: 39.55 KiB)
Quicksort:           29.800 μs (0 allocations: 0 bytes)
BranchyPdqSort:      14.900 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   15.000 μs (2 allocations: 1.06 KiB)
sort!([1:83521;10])
Default:             448.400 μs (3 allocations: 656.80 KiB)
Quicksort:           651.300 μs (0 allocations: 0 bytes)
BranchyPdqSort:      257.300 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   257.700 μs (2 allocations: 1.06 KiB)
sort!([1:1419857;10])
Default:             13.690 ms (3 allocations: 10.83 MiB)
Quicksort:           13.954 ms (0 allocations: 0 bytes)
BranchyPdqSort:      4.400 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   4.178 ms (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},289), by=abs2)
Default:             8.200 μs (1 allocation: 4.69 KiB)
Quicksort:           11.600 μs (0 allocations: 0 bytes)
BranchyPdqSort:      11.800 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   7.900 μs (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},4913), by=abs2)
Default:             211.000 μs (2 allocations: 76.86 KiB)
Quicksort:           314.600 μs (0 allocations: 0 bytes)
BranchyPdqSort:      323.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   179.800 μs (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},83521), by=abs2)
Default:             5.041 ms (2 allocations: 1.27 MiB)
Quicksort:           7.692 ms (0 allocations: 0 bytes)
BranchyPdqSort:      7.922 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   3.739 ms (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},1419857), by=abs2)
Default:             109.978 ms (2 allocations: 21.67 MiB)
Quicksort:           166.103 ms (0 allocations: 0 bytes)
BranchyPdqSort:      172.113 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   75.914 ms (2 allocations: 1.06 KiB)

Additional correctness checks

No stable sorting guarantee -> simpler checks.

test script
using Base.Order
using Test
using SortingAlgorithms
for T in (Float64, Int, UInt8, UInt128)
    println(T)
    for alg in [BranchlessPdqSort, BranchyPdqSort]
        for order in [Forward, Reverse, By(identity)]
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = rand(T, n)
                @test sort(v; order) == sort(v; alg, order)
            end
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = rand(T, n) .% 4
                @test sort(v; order) == sort(v; alg, order)
            end
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = [9;sort(rand(T, n));8]
                @test sort(v; order) == sort(v; alg, order)
            end
        end
    end
end

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2023

Codecov Report

Merging #72 (8b44ef3) into master (2659334) will increase coverage by 0.96%.
The diff coverage is 98.37%.

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
+ Coverage   96.02%   96.98%   +0.96%     
==========================================
  Files           1        1              
  Lines         352      598     +246     
==========================================
+ Hits          338      580     +242     
- Misses         14       18       +4     
Impacted Files Coverage Δ
src/SortingAlgorithms.jl 96.98% <98.37%> (+0.96%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

 - remove selectpivot! function
 - use m instead of mid
 - use lo, m, hi in function signatures
 - midpoint instead of _midpoint
Test results:
- 0.8.0: fail
- 0.8.1: pass
- 0.9.0: pass
- 0.9.2: pass
- 0.10.0: pass
- 0.11.0: pass
- 0.12.0: pass
- 1.0.0: pass
- 1.5.15: pass
@LSchwerdt LSchwerdt marked this pull request as ready for review February 16, 2023 17:36
... in case of a highly unbalanced partition.

The randomization is inspired by ScratchQuickSort.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants