AA' significantly slower than AA for ArbMatrices #119

nanleij · 2021-04-12T10:25:13Z

Hi,

I was trying to see if I could use this instead of BigFloats for some of my computations, for which i (among other things like cholesky decompositions and computing the minimum eigenvalue) need to do A*A' for matrices of size up to about 300x300, which is currently the bottleneck.

I came across the following remarkable difference:

julia> A = ArbMatrix(rand(BigFloat,100,100));

julia> @benchmark $A*$A
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     128.750 ms (0.00% GC)
  median time:      132.684 ms (0.00% GC)
  mean time:        133.845 ms (0.00% GC)
  maximum time:     148.832 ms (0.00% GC)
  --------------
  samples:          38
  evals/sample:     1

julia> @benchmark $A*$A'
BenchmarkTools.Trial:
  memory estimate:  249.02 MiB
  allocs estimate:  4080001
  --------------
  minimum time:     3.125 s (22.31% GC)
  median time:      3.331 s (26.56% GC)
  mean time:        3.331 s (26.56% GC)
  maximum time:     3.537 s (30.31% GC)
  --------------
  samples:          2
  evals/sample:     1

So multiplying with a transpose is significantly slower than normal multiplication, maybe because of all the allocations which do happen in the multiplication with the transpose.
Comparing it to BigFloats:

julia> B = rand(BigFloat,100,100);

julia> @benchmark $B*$B
BenchmarkTools.Trial:
  memory estimate:  218.35 MiB
  allocs estimate:  4100002
  --------------
  minimum time:     911.017 ms (11.87% GC)
  median time:      930.252 ms (11.95% GC)
  mean time:        933.218 ms (11.67% GC)
  maximum time:     962.344 ms (12.72% GC)
  --------------
  samples:          6
  evals/sample:     1

julia> @benchmark $B*$B'
BenchmarkTools.Trial:
  memory estimate:  218.35 MiB
  allocs estimate:  4100002
  --------------
  minimum time:     1.228 s (8.92% GC)
  median time:      1.237 s (9.04% GC)
  mean time:        1.238 s (8.64% GC)
  maximum time:     1.255 s (9.08% GC)
  --------------
  samples:          5
  evals/sample:     1

So for BigFloats, multiplying with the transpose is also a bit slower, but only 30% compared to about 2500%. Another way would be to use normal matrices with Arb entries:

julia> C = rand(Arb,100,100);

julia> @benchmark $C*$C
BenchmarkTools.Trial:
  memory estimate:  124.66 MiB
  allocs estimate:  2040002
  --------------
  minimum time:     1.572 s (14.81% GC)
  median time:      1.807 s (20.96% GC)
  mean time:        1.872 s (25.84% GC)
  maximum time:     2.237 s (37.54% GC)
  --------------
  samples:          3
  evals/sample:     1

julia> @benchmark $C*$C'
BenchmarkTools.Trial:
  memory estimate:  124.66 MiB
  allocs estimate:  2040002
  --------------
  minimum time:     2.014 s (11.50% GC)
  median time:      2.029 s (19.55% GC)
  mean time:        2.166 s (22.57% GC)
  maximum time:     2.456 s (34.14% GC)
  --------------
  samples:          3
  evals/sample:     1

So this is more comparable to the BigFloats, only a factor 2 difference in both cases. (Interestingly, it only uses about half the memory/allocations)

To me it seems like the ArbMatrices fall back to the generic matrix multiplication in the case with the transpose. Is there an easy way around this?

PS: I also tried Nemo (which has similar speed for A*A and A*A'), but I also need the minimum eigenvalue and the cholesky decomposition, which are currently not in Nemo. So I would need to convert back and forth to BigFloats, or compute those myself. With Nemo, I'm not sure how to do this but with Arblib the conversion is easy. Hence why I would very much like a similar speed for A*A' as for A*A.

The text was updated successfully, but these errors were encountered:

saschatimme · 2021-04-12T11:58:44Z

To me it seems like the ArbMatrices fall back to the generic matrix multiplication in the case with the transpose. Is there an easy way around this?

Arb only implements matrix multiplication A*B and no special versions where A or B are transposed. The easiest thing is probably to realize A' as an ArbMatrix and to then perform the multiplication. This is also how I would probably implement this here in Arblib. A pull request for this is very much welcomed 🙂

kalmarek · 2021-04-12T12:13:12Z

julia> @btime A*A;
  11.988 ms (10089 allocations: 4.11 MiB)

julia> @btime A*A';
  786.593 ms (8160005 allocations: 436.25 MiB)

julia> C = similar(A); D = similar(A); @btime Arblib.mul!($D, $A, Arblib.transpose!($C, $A));
  12.785 ms (85 allocations: 3.20 MiB)

we're hitting the generic matmul here:

julia> @which A*A
*(A::T, B::T) where T<:Union{AcbMatrix, AcbRefMatrix, ArbMatrix, ArbRefMatrix} in Arblib at /home/kalmar/.julia/dev/Arblib/src/matrix.jl:166

julia> @which A*A'
*(A::AbstractMatrix{T} where T, B::AbstractMatrix{T} where T) in LinearAlgebra at /home/kalmar/packages/julias/julia-1.6/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:151

nanleij · 2021-04-12T12:40:44Z

A*ArbMatrix(A') works like a charm! Thanks!
I might look into a pull request later

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AA' significantly slower than AA for ArbMatrices #119

AA' significantly slower than AA for ArbMatrices #119

nanleij commented Apr 12, 2021 •

edited

Loading

saschatimme commented Apr 12, 2021

kalmarek commented Apr 12, 2021 •

edited

Loading

nanleij commented Apr 12, 2021

A*A' significantly slower than A*A for ArbMatrices #119

A*A' significantly slower than A*A for ArbMatrices #119

Comments

nanleij commented Apr 12, 2021 • edited Loading

saschatimme commented Apr 12, 2021

kalmarek commented Apr 12, 2021 • edited Loading

nanleij commented Apr 12, 2021

AA' significantly slower than AA for ArbMatrices #119

AA' significantly slower than AA for ArbMatrices #119

nanleij commented Apr 12, 2021 •

edited

Loading

kalmarek commented Apr 12, 2021 •

edited

Loading