Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~5x speedup than SparseArrays #28

Open
Roger-luo opened this issue Jan 14, 2023 · 3 comments
Open

~5x speedup than SparseArrays #28

Roger-luo opened this issue Jan 14, 2023 · 3 comments

Comments

@Roger-luo
Copy link
Member

Roger-luo commented Jan 14, 2023

conclusion: this package seems slightly slow than existing ones, but faster for some of the sizes, bmul is the fastest one, I'm looking forward to see how much @batch can improve on this

not sure why, but this is currently 8x slower than SparseArrays, the following is the comparison at different size

I just have a naive benchmark, currently have 5x speedup at 2^20 x 2^20

julia> speedup = map(zip(reports, base_reports)) do (r, br)
           minimum(br).time / minimum(r).time
       end
9-element Vector{Float64}:
 0.00560784995998469
 0.013175431564728122
 0.03735109945863908
 0.20631260631260628
 0.4875896304467733
 2.7184508079539054
 4.600574856107668
 5.6593177540493
 5.013580931023638

version info (8 threads on M1 pro)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 10 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
the script

to reproduce

using BenchmarkTools
using ParallelMergeCSR
using SparseArrays

function benchmark(n::Int)
    A = sprand(2^n, 2^n, 1e-4)
    x = rand(2^n)
    y = similar(x)
    return @benchmark ParallelMergeCSR.mul!($y, transpose($A), $x, 1.0, 0.0)
end

function benchmark_base(n::Int)
    A = sprand(2^n, 2^n, 1e-4)
    x = rand(2^n)
    y = similar(x)
    return @benchmark SparseArrays.mul!($y, transpose($A), $x, 1.0, 0.0)
end

reports = []
for n in 4:2:20
    @info "benchmarking" n
    push!(reports, benchmark(n))
end

base_reports = []
for n in 4:2:20
    @info "benchmarking" n
    push!(base_reports, benchmark_base(n))
end

speedup = map(zip(reports, base_reports)) do (r, br)
    minimum(br).time / minimum(r).time
end
@Roger-luo Roger-luo added the bug Something isn't working label Jan 14, 2023
@Roger-luo Roger-luo changed the title ~8x slower than SparseArrays ~5x speedup than SparseArrays Jan 14, 2023
@Roger-luo Roger-luo removed the bug Something isn't working label Jan 14, 2023
@Roger-luo
Copy link
Member Author

more detailed benchmark comparing with ThreadedSparseCSR

julia> speedup = map(reports["base"], reports["this"]) do br, r
           minimum(br).time / minimum(r).time
       end
9-element Vector{Float64}:
 0.0055334472724891184
 0.010322679308762808
 0.04274428084218395
 0.14642857142857144
 0.47042220219007225
 2.1794871794871793
 4.394308194754103
 5.724467113647089
 4.775544347485181

julia> speedup = map(reports["base"], reports["tmul"]) do br, r
           minimum(br).time / minimum(r).time
       end
9-element Vector{Float64}:
 0.006753095374756376
 0.03454683929931455
 0.09163389771566433
 0.5019588638589618
 0.9172189246816113
 3.053892215568862
 6.099349844322258
 5.354817480719794
 5.343557759590914

julia> speedup = map(reports["base"], reports["bmul"]) do br, r
           minimum(br).time / minimum(r).time
       end
9-element Vector{Float64}:
 0.014777543308632282
 0.03007957559681698
 0.11305128801015392
 0.4823529411764706
 1.7014805708482548
 7.927461139896373
 7.44011544011544
 5.3025629910750975
 5.277065594192646
with the following script
using BenchmarkTools
using ParallelMergeCSR
using SparseArrays
using SparseMatricesCSR
using ThreadedSparseCSR
ThreadedSparseCSR.get_num_threads()

function benchmark(n::Int)
    A = transpose(sprand(2^n, 2^n, 1e-4))
    x = rand(2^n)
    y = similar(x)
    return @benchmark ParallelMergeCSR.mul!($y, $A, $x, 1.0, 0.0)
end

function benchmark_base(n::Int)
    A = transpose(sprand(2^n, 2^n, 1e-4))
    x = rand(2^n)
    y = similar(x)
    return @benchmark SparseArrays.mul!($y, $A, $x, 1.0, 0.0)
end

function benchmark_tmul(n::Int)
    x = rand(2^n)
    y = similar(x)
    csrA = SparseMatrixCSR(transpose(sprand(2^n, 2^n, 1e-4)));
    return @benchmark tmul!($y, $csrA, $x, 1.0, 0.0)
end

function benchmark_bmul(n::Int)
    x = rand(2^n)
    y = similar(x)
    csrA = SparseMatrixCSR(transpose(sprand(2^n, 2^n, 1e-4)));
    return @benchmark bmul!($y, $csrA, $x, 1.0, 0.0)
end

reports = Dict(
    "this" => [],
    "base" => [],
    "tmul" => [],
    "bmul" => [],
)

for n in 4:2:20
    @info "benchmarking" n
    push!(reports["this"], benchmark(n))
    push!(reports["base"], benchmark_base(n))
    push!(reports["tmul"], benchmark_tmul(n))
    push!(reports["bmul"], benchmark_bmul(n))
end

speedup = map(reports["base"], reports["this"]) do br, r
    minimum(br).time / minimum(r).time
end

speedup = map(reports["base"], reports["tmul"]) do br, r
    minimum(br).time / minimum(r).time
end

speedup = map(reports["base"], reports["bmul"]) do br, r
    minimum(br).time / minimum(r).time
end

@weinbe58
Copy link
Member

These results really don't make sense to me, why does the speedup seem to decrease with larger matrices?

@weinbe58
Copy link
Member

Also, One issue I see is that the matrices are not the same between the benchmarks so it's not exactly a fair comparison for smaller matrices since the fluctuations with the number of non-zero elements from matrix-to-matrix are larger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants