Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce load time by shifting mul! definition #1904

Merged
merged 9 commits into from
Jun 2, 2023
Merged

Reduce load time by shifting mul! definition #1904

merged 9 commits into from
Jun 2, 2023

Conversation

dkarrasch
Copy link
Contributor

This is an attempt to reduce load time. It does that by not hooking into mul!, but by letting LinearAlgebra.jl take the first step, forward to generic_matmatmul!, and that's where we hook in. The logic is super similar to that applied in gemm_dispatch!, which I manually "inlined" (i.e., copy-pasted) here. I don't know if external packages rely on gemm_dispatch!, or if that's something that could be removed.

@KristofferC pointed me to these methods. The same could be done for matvec multiplication, if that turns out to be significant contributor to load time.

@dkarrasch
Copy link
Contributor Author

While I'm fixing my own typos, let me ask @amontoison: in #1632 you introduced mul! methods for doubly-wrapped matrices. Then you unwrap both of them, but only keep track whether one of the wrappers was a Transpose or Adjoint. Unless I'm missing something, the called method doesn't seem to know that the original matrix was a HermOrSym? It also seems that symmetric matrices are not tested?

@amontoison
Copy link
Member

amontoison commented May 12, 2023

Hi @dkarrasch, NVIDIA didn't developed routines for symmetric sparse matrices and in the CUDA documentation they explain that we must always store the two triangles.

The wrapper HermOrSym is not relevant here but we still need to define new mul! methods to perform matrix-vector / matrix-matrix if the sparse matrices are wrapped in these Lazy wrappers.

@amontoison
Copy link
Member

amontoison commented May 12, 2023

@dkarrasch I'm wondering why in Julia, T \ v dispatches to the two-arguments ldiv! when T is triangular. With CUDA 12.0, the in-place backward and forward sweeps were removed and I added a collection of \ methods to use the three-arguments ldiv!.


@eval begin
function LinearAlgebra.mul!(C::CuVector{T}, A::$TypeA, B::CuSparseVector{T}, alpha::Number, beta::Number) where {T <: BlasFloat}
gemvi!($transa(T), alpha, $(untaga(unwrapa(:A))), B, beta, C, 'O')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amontoison What does untaga(unwrapa(:A)) do when A is Adjoint{Float32,Hermitian{...}?

Copy link
Member

@amontoison amontoison May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should return parent(parent(A)) and remove all the lazy wrappers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but then I don't understand how do you remember that the A factor was Hermitian? Thentransa(T) just stores the adjoint.

Copy link
Member

@amontoison amontoison May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to remember that A is Hermitian because we don't have CUDA routines specialized for them.
We just want to have a method when A is wrapped in Symmetric / Hermitian wrappers.

@maleadt
Copy link
Member

maleadt commented May 12, 2023

Interesting! I'm not familiar with LinearAlgebra dispatch, so really appreciate you taking a look at this.

At some point in the past, CUBLAS.jl used to implement LinearAlgebra.BLAS calls directly, which we got rid of because it assumed that CUBLAS behaves similar to CPU BLAS. Which is true most of the time, but not always, so implementing higher-level methods seemed like the better thing to do. I guess this takes (half) a step back, but that's probably worth it in the name of latency.

@maleadt
Copy link
Member

maleadt commented May 13, 2023

Did a quick profile, and this does seem to cut the load time of the CUDA.jl pkgimage in half (1.36s -> ~600ms); nice! Loading of all of CUDA.jl improves from 2.5s to 1.8s. Here's the profiles:

There's still a bunch of mul! methods from CUSPARSE, as well as a similar explosion of ldiv! methods; could those benefit from a similar optimization?

@dkarrasch
Copy link
Contributor Author

I tried to change the CUSPARSE code, but got confused with the "double wrappers" and tests were failing. I'll try that again with a fresh mind, perhaps in a new PR. If ldiv! is about triangular matrices, there should be some options as well. Basically, you let LinearAlgebra handle matrix rhs, which turns the matrix solve into many column solves. That part is fairly generic, so that you don't need to overload that. So you end up overloading ldiv! for vector rhs, which is a different, smaller dispatch "niche" and should hopefully speed up method insertion.

@dkarrasch
Copy link
Contributor Author

Hm, I have no idea how tests like this can fail due to this PR:

@testset "hemm!" begin
                # compute
                C = alpha*(hA*B) + beta*C
                CUBLAS.hemm!('L','L',alpha,dhA,d_B,beta,d_C)
                # move to host and compare
                h_C = Array(d_C)
                @test C  h_C
            end

The line C = alpha*(hA*B) + beta*C involves plain Matrixes, and is handled by LinearAlgebra, obviously. The line CUBLAS.hemm!('L','L',alpha,dhA,d_B,beta,d_C) is a direct call and cannot be deviated by the dispatch. How can any of these calls be affected by the mul! dispatch that, in this package, is restricted to CuArrays?

@maleadt
Copy link
Member

maleadt commented May 31, 2023

How can any of these calls be affected by the mul! dispatch that, in this package, is restricted to CuArrays?

The problem is with the inputs, i.e., hA ≈ Array(dhA) fails on your PR. The CUBLAS tests are pretty messy, in that some global arrays are reused throughout the tests. I'll try narrowing down where the change is introduced.

EDIT: ah, the problem is on line 660: that mul! used to throw an ArgumentError, and thus wasn't being executed. On this PR, the ArgumentError is gone, resulting in a mutation of dhA, which makes later tests fail.

@dkarrasch
Copy link
Contributor Author

Ah, okay, cool, thanks for spotting this. This should get fixed by the GPUArrays.jl PR.

@dkarrasch
Copy link
Contributor Author

Let's rerun CI and see how it goes.

@dkarrasch dkarrasch closed this Jun 1, 2023
@dkarrasch dkarrasch reopened this Jun 1, 2023
@maleadt
Copy link
Member

maleadt commented Jun 1, 2023

First needs a bump of the Manifest. I'll push that shortly.

@maleadt
Copy link
Member

maleadt commented Jun 1, 2023

Interesting failure. Looks like a bug in GPUArrays' generic matmatmul, where NaN inputs get preserved even if beta is zero. I'm not sure why this only surfaces on this PR though; we shouldn't have dispatched to CUBLAS for the ComplexF16 previously either.

@maleadt
Copy link
Member

maleadt commented Jun 1, 2023

The remaining failure is because e.g. mul!(y::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, x::CuSparseVector{Float32, Int32}, α::Bool, β::Bool) currently dispatches to mul! overloads in SparseArrays.jl. I guess we won't be able to do this part until SparseArrays is updated?

@dkarrasch
Copy link
Contributor Author

dkarrasch commented Jun 1, 2023

I merged the SparseArrays.jl PR. That affects only master/nightly/v1.10 though, so we may need to keep some mul! methods for older versions?

Thanks for fixing the typos!

@maleadt
Copy link
Member

maleadt commented Jun 1, 2023

That affects only master/nightly/v1.10 though, so we may need to keep some mul! methods for older versions?

Yes, definitely. Luckily 1.10 is slated to be the new LTS so we should be able to get rid of that code in half a year or so, but for now we need to keep the functionality.

@dkarrasch
Copy link
Contributor Author

I suggest to remove the "double wrappings", and simplify the method signature to only one wrapper. Should I prepare and push or are you working on it right now?

@maleadt
Copy link
Member

maleadt commented Jun 1, 2023

I suggest to remove the "double wrappings", and simplify the method signature to only one wrapper.

If you have an idea how to simplify, then please! I was working on something else right now, so feel free to push here, but I'd understand if you're fed up by now 🙂

@dkarrasch
Copy link
Contributor Author

but I'd understand if you're fed up by now 🙂

Not yet 😜

@maleadt
Copy link
Member

maleadt commented Jun 2, 2023

Great!

@maleadt maleadt merged commit 47ec76b into JuliaGPU:master Jun 2, 2023
@KristofferC
Copy link
Contributor

Epic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants