Reduce load time by shifting `mul!` definition #1904

dkarrasch · 2023-05-12T08:52:26Z

This is an attempt to reduce load time. It does that by not hooking into mul!, but by letting LinearAlgebra.jl take the first step, forward to generic_matmatmul!, and that's where we hook in. The logic is super similar to that applied in gemm_dispatch!, which I manually "inlined" (i.e., copy-pasted) here. I don't know if external packages rely on gemm_dispatch!, or if that's something that could be removed.

@KristofferC pointed me to these methods. The same could be done for matvec multiplication, if that turns out to be significant contributor to load time.

dkarrasch · 2023-05-12T12:55:43Z

While I'm fixing my own typos, let me ask @amontoison: in #1632 you introduced mul! methods for doubly-wrapped matrices. Then you unwrap both of them, but only keep track whether one of the wrappers was a Transpose or Adjoint. Unless I'm missing something, the called method doesn't seem to know that the original matrix was a HermOrSym? It also seems that symmetric matrices are not tested?

amontoison · 2023-05-12T13:43:45Z

Hi @dkarrasch, NVIDIA didn't developed routines for symmetric sparse matrices and in the CUDA documentation they explain that we must always store the two triangles.

The wrapper HermOrSym is not relevant here but we still need to define new mul! methods to perform matrix-vector / matrix-matrix if the sparse matrices are wrapped in these Lazy wrappers.

amontoison · 2023-05-12T13:48:56Z

@dkarrasch I'm wondering why in Julia, T \ v dispatches to the two-arguments ldiv! when T is triangular. With CUDA 12.0, the in-place backward and forward sweeps were removed and I added a collection of \ methods to use the three-arguments ldiv!.

dkarrasch · 2023-05-12T14:32:24Z

lib/cusparse/interfaces.jl


    @eval begin
        function LinearAlgebra.mul!(C::CuVector{T}, A::$TypeA, B::CuSparseVector{T}, alpha::Number, beta::Number) where {T <: BlasFloat}
-            gemvi!($transa(T), alpha, $(untaga(unwrapa(:A))), B, beta, C, 'O')


@amontoison What does untaga(unwrapa(:A)) do when A is Adjoint{Float32,Hermitian{...}?

It should return parent(parent(A)) and remove all the lazy wrappers.

Ok, but then I don't understand how do you remember that the A factor was Hermitian? Thentransa(T) just stores the adjoint.

We don't need to remember that A is Hermitian because we don't have CUDA routines specialized for them.
We just want to have a method when A is wrapped in Symmetric / Hermitian wrappers.

maleadt · 2023-05-12T18:08:45Z

Interesting! I'm not familiar with LinearAlgebra dispatch, so really appreciate you taking a look at this.

At some point in the past, CUBLAS.jl used to implement LinearAlgebra.BLAS calls directly, which we got rid of because it assumed that CUBLAS behaves similar to CPU BLAS. Which is true most of the time, but not always, so implementing higher-level methods seemed like the better thing to do. I guess this takes (half) a step back, but that's probably worth it in the name of latency.

maleadt · 2023-05-13T17:15:48Z

Did a quick profile, and this does seem to cut the load time of the CUDA.jl pkgimage in half (1.36s -> ~600ms); nice! Loading of all of CUDA.jl improves from 2.5s to 1.8s. Here's the profiles:

There's still a bunch of mul! methods from CUSPARSE, as well as a similar explosion of ldiv! methods; could those benefit from a similar optimization?

dkarrasch · 2023-05-13T17:33:23Z

I tried to change the CUSPARSE code, but got confused with the "double wrappers" and tests were failing. I'll try that again with a fresh mind, perhaps in a new PR. If ldiv! is about triangular matrices, there should be some options as well. Basically, you let LinearAlgebra handle matrix rhs, which turns the matrix solve into many column solves. That part is fairly generic, so that you don't need to overload that. So you end up overloading ldiv! for vector rhs, which is a different, smaller dispatch "niche" and should hopefully speed up method insertion.

dkarrasch · 2023-05-26T09:02:02Z

Hm, I have no idea how tests like this can fail due to this PR:

@testset "hemm!" begin
                # compute
                C = alpha*(hA*B) + beta*C
                CUBLAS.hemm!('L','L',alpha,dhA,d_B,beta,d_C)
                # move to host and compare
                h_C = Array(d_C)
                @test C ≈ h_C
            end

The line C = alpha*(hA*B) + beta*C involves plain Matrixes, and is handled by LinearAlgebra, obviously. The line CUBLAS.hemm!('L','L',alpha,dhA,d_B,beta,d_C) is a direct call and cannot be deviated by the dispatch. How can any of these calls be affected by the mul! dispatch that, in this package, is restricted to CuArrays?

maleadt · 2023-05-31T08:25:25Z

How can any of these calls be affected by the mul! dispatch that, in this package, is restricted to CuArrays?

The problem is with the inputs, i.e., hA ≈ Array(dhA) fails on your PR. The CUBLAS tests are pretty messy, in that some global arrays are reused throughout the tests. I'll try narrowing down where the change is introduced.

EDIT: ah, the problem is on line 660: that mul! used to throw an ArgumentError, and thus wasn't being executed. On this PR, the ArgumentError is gone, resulting in a mutation of dhA, which makes later tests fail.

dkarrasch · 2023-05-31T11:39:32Z

Ah, okay, cool, thanks for spotting this. This should get fixed by the GPUArrays.jl PR.

dkarrasch · 2023-06-01T12:43:18Z

Let's rerun CI and see how it goes.

maleadt · 2023-06-01T12:44:11Z

First needs a bump of the Manifest. I'll push that shortly.

maleadt · 2023-06-01T14:35:48Z

Interesting failure. Looks like a bug in GPUArrays' generic matmatmul, where NaN inputs get preserved even if beta is zero. I'm not sure why this only surfaces on this PR though; we shouldn't have dispatched to CUBLAS for the ComplexF16 previously either.

maleadt · 2023-06-01T16:45:10Z

The remaining failure is because e.g. mul!(y::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, x::CuSparseVector{Float32, Int32}, α::Bool, β::Bool) currently dispatches to mul! overloads in SparseArrays.jl. I guess we won't be able to do this part until SparseArrays is updated?

dkarrasch · 2023-06-01T17:19:53Z

I merged the SparseArrays.jl PR. That affects only master/nightly/v1.10 though, so we may need to keep some mul! methods for older versions?

Thanks for fixing the typos!

maleadt · 2023-06-01T17:27:17Z

That affects only master/nightly/v1.10 though, so we may need to keep some mul! methods for older versions?

Yes, definitely. Luckily 1.10 is slated to be the new LTS so we should be able to get rid of that code in half a year or so, but for now we need to keep the functionality.

dkarrasch · 2023-06-01T17:30:59Z

I suggest to remove the "double wrappings", and simplify the method signature to only one wrapper. Should I prepare and push or are you working on it right now?

maleadt · 2023-06-01T17:39:50Z

I suggest to remove the "double wrappings", and simplify the method signature to only one wrapper.

If you have an idea how to simplify, then please! I was working on something else right now, so feel free to push here, but I'd understand if you're fed up by now 🙂

dkarrasch · 2023-06-01T19:10:01Z

but I'd understand if you're fed up by now 🙂

Not yet 😜

maleadt · 2023-06-02T05:37:42Z

Great!

KristofferC · 2023-06-02T05:40:31Z

Epic!

dkarrasch commented May 12, 2023

View reviewed changes

This was referenced May 12, 2023

Reduce number of mul! methods JuliaGPU/GPUArrays.jl#472

Merged

Reduce number of mul! methods JuliaGPU/oneAPI.jl#318

Merged

dkarrasch closed this Jun 1, 2023

dkarrasch reopened this Jun 1, 2023

dkarrasch and others added 4 commits June 1, 2023 14:51

Reduce load time by shifting mul! definition

2fd534d

Bump GPUArrays.

2dcd60a

Remove unneeded module prefixes.

ff17f80

Fix matvecmul.

47303d7

maleadt mentioned this pull request Jun 1, 2023

Fix generic matmatmul NaN handling. JuliaGPU/GPUArrays.jl#476

Merged

maleadt added 2 commits June 1, 2023 17:14

Fix CUSPARSE changes.

2292af3

Bump GPUArrays.

eca9fbf

dkarrasch and others added 3 commits June 1, 2023 21:33

single-layer wrappers, disambiguate with SparseArrays

c4cc2f7

Test against 1.9.

e7bab9e

Fix typos.

5cd98de

maleadt merged commit 47ec76b into JuliaGPU:master Jun 2, 2023

dkarrasch deleted the patch-1 branch June 2, 2023 06:21

amontoison mentioned this pull request Sep 7, 2023

CuSparseMatrix - CuMatrix multiplication not working: giving Scalar Indexing #2072

Open

dkarrasch mentioned this pull request Nov 5, 2023

Add fastpath for generic mul! JuliaLang/julia#51812

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce load time by shifting `mul!` definition #1904

Reduce load time by shifting `mul!` definition #1904

dkarrasch commented May 12, 2023

dkarrasch commented May 12, 2023

amontoison commented May 12, 2023 •

edited

Loading

amontoison commented May 12, 2023 •

edited

Loading

dkarrasch May 12, 2023

amontoison May 12, 2023 •

edited

Loading

dkarrasch May 12, 2023

amontoison May 12, 2023 •

edited

Loading

maleadt commented May 12, 2023

maleadt commented May 13, 2023

dkarrasch commented May 13, 2023

dkarrasch commented May 26, 2023

maleadt commented May 31, 2023 •

edited

Loading

dkarrasch commented May 31, 2023

dkarrasch commented Jun 1, 2023

maleadt commented Jun 1, 2023

maleadt commented Jun 1, 2023

maleadt commented Jun 1, 2023

dkarrasch commented Jun 1, 2023 •

edited

Loading

maleadt commented Jun 1, 2023

dkarrasch commented Jun 1, 2023

maleadt commented Jun 1, 2023

dkarrasch commented Jun 1, 2023

maleadt commented Jun 2, 2023

KristofferC commented Jun 2, 2023

Reduce load time by shifting mul! definition #1904

Reduce load time by shifting mul! definition #1904

Conversation

dkarrasch commented May 12, 2023

dkarrasch commented May 12, 2023

amontoison commented May 12, 2023 • edited Loading

amontoison commented May 12, 2023 • edited Loading

dkarrasch May 12, 2023

Choose a reason for hiding this comment

amontoison May 12, 2023 • edited Loading

Choose a reason for hiding this comment

dkarrasch May 12, 2023

Choose a reason for hiding this comment

amontoison May 12, 2023 • edited Loading

Choose a reason for hiding this comment

maleadt commented May 12, 2023

maleadt commented May 13, 2023

dkarrasch commented May 13, 2023

dkarrasch commented May 26, 2023

maleadt commented May 31, 2023 • edited Loading

dkarrasch commented May 31, 2023

dkarrasch commented Jun 1, 2023

maleadt commented Jun 1, 2023

maleadt commented Jun 1, 2023

maleadt commented Jun 1, 2023

dkarrasch commented Jun 1, 2023 • edited Loading

maleadt commented Jun 1, 2023

dkarrasch commented Jun 1, 2023

maleadt commented Jun 1, 2023

dkarrasch commented Jun 1, 2023

maleadt commented Jun 2, 2023

KristofferC commented Jun 2, 2023

Reduce load time by shifting `mul!` definition #1904

Reduce load time by shifting `mul!` definition #1904

amontoison commented May 12, 2023 •

edited

Loading

amontoison commented May 12, 2023 •

edited

Loading

amontoison May 12, 2023 •

edited

Loading

amontoison May 12, 2023 •

edited

Loading

maleadt commented May 31, 2023 •

edited

Loading

dkarrasch commented Jun 1, 2023 •

edited

Loading