-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce load time by shifting mul!
definition
#1904
Conversation
While I'm fixing my own typos, let me ask @amontoison: in #1632 you introduced |
Hi @dkarrasch, NVIDIA didn't developed routines for symmetric sparse matrices and in the CUDA documentation they explain that we must always store the two triangles. The wrapper |
@dkarrasch I'm wondering why in Julia, |
|
||
@eval begin | ||
function LinearAlgebra.mul!(C::CuVector{T}, A::$TypeA, B::CuSparseVector{T}, alpha::Number, beta::Number) where {T <: BlasFloat} | ||
gemvi!($transa(T), alpha, $(untaga(unwrapa(:A))), B, beta, C, 'O') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amontoison What does untaga(unwrapa(:A))
do when A
is Adjoint{Float32,Hermitian{...}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should return parent(parent(A))
and remove all the lazy wrappers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but then I don't understand how do you remember that the A
factor was Hermitian? Thentransa(T)
just stores the adjoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to remember that A
is Hermitian because we don't have CUDA routines specialized for them.
We just want to have a method when A
is wrapped in Symmetric
/ Hermitian
wrappers.
Interesting! I'm not familiar with LinearAlgebra dispatch, so really appreciate you taking a look at this. At some point in the past, CUBLAS.jl used to implement |
Did a quick profile, and this does seem to cut the load time of the CUDA.jl pkgimage in half (1.36s -> ~600ms); nice! Loading of all of CUDA.jl improves from 2.5s to 1.8s. Here's the profiles:
There's still a bunch of |
I tried to change the CUSPARSE code, but got confused with the "double wrappers" and tests were failing. I'll try that again with a fresh mind, perhaps in a new PR. If |
Hm, I have no idea how tests like this can fail due to this PR: @testset "hemm!" begin
# compute
C = alpha*(hA*B) + beta*C
CUBLAS.hemm!('L','L',alpha,dhA,d_B,beta,d_C)
# move to host and compare
h_C = Array(d_C)
@test C ≈ h_C
end The line |
The problem is with the inputs, i.e., EDIT: ah, the problem is on line 660: that |
Ah, okay, cool, thanks for spotting this. This should get fixed by the GPUArrays.jl PR. |
Let's rerun CI and see how it goes. |
First needs a bump of the Manifest. I'll push that shortly. |
Interesting failure. Looks like a bug in GPUArrays' generic matmatmul, where NaN inputs get preserved even if |
The remaining failure is because e.g. |
I merged the SparseArrays.jl PR. That affects only master/nightly/v1.10 though, so we may need to keep some Thanks for fixing the typos! |
Yes, definitely. Luckily 1.10 is slated to be the new LTS so we should be able to get rid of that code in half a year or so, but for now we need to keep the functionality. |
I suggest to remove the "double wrappings", and simplify the method signature to only one wrapper. Should I prepare and push or are you working on it right now? |
If you have an idea how to simplify, then please! I was working on something else right now, so feel free to push here, but I'd understand if you're fed up by now 🙂 |
Not yet 😜 |
Great! |
Epic! |
This is an attempt to reduce load time. It does that by not hooking into
mul!
, but by lettingLinearAlgebra.jl
take the first step, forward togeneric_matmatmul!
, and that's where we hook in. The logic is super similar to that applied ingemm_dispatch!
, which I manually "inlined" (i.e., copy-pasted) here. I don't know if external packages rely ongemm_dispatch!
, or if that's something that could be removed.@KristofferC pointed me to these methods. The same could be done for matvec multiplication, if that turns out to be significant contributor to load time.