Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Use non-blocking device side pointer mode in CUBLAS, with fallbacks #2616

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kshyatt
Copy link
Contributor

@kshyatt kshyatt commented Jan 10, 2025

Attempting to address #2571

I've set the pointer mode to "device side" during handle creation. Since gemmGroupedBatched doesn't support device side pointer mode, it won't be usable. One workaround for this would be to add a new function to create a handle with host side mode, or add the pointer mode as an optional kwarg to handle(). Very open to feedback on this.

I've set this up so that users can supply CuRefs of the appropriate result type to the level 1 functions for results. If that's not provided, the functions execute as they do today (synchronously). Similarly, for functions taking alpha or beta scalar arguments, if the user provides CuRef (actually a CuRefArray), the functions will execute asynchronously and return instantly. If the user provides a Number, the behaviour is unchanged from today. I'm not married to this design and it can certainly be changed.

cc @Jutho

@kshyatt kshyatt requested a review from maleadt January 10, 2025 21:03
@kshyatt kshyatt added the cuda libraries Stuff about CUDA library wrappers. label Jan 10, 2025
@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 10, 2025

I can also add some more @eval blocks to try to cut down on the repetitive fallback logic

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 10, 2025

Sample speedup:

julia> using CUDA, CUDA.CUBLAS, LinearAlgebra;

julia> n = Int(2^26);

julia> X = CUDA.rand(Float64, n);

julia> res = CuRef{Float64}(0.0);

# do some precompilation runs first

julia> @time CUBLAS.nrm2(n, X, res);
  0.000104 seconds (18 allocations: 288 bytes)

julia> @time CUBLAS.nrm2(n, X);
  0.001564 seconds (73 allocations: 3.094 KiB)

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: fd59678 Previous: 3d45d85 Ratio
latency/precompile 45680359374.5 ns 45362897043 ns 1.01
latency/ttfp 6379513705 ns 6376155312.5 ns 1.00
latency/import 3037486742 ns 3036001837 ns 1.00
integration/volumerhs 9568366 ns 9568516 ns 1.00
integration/byval/slices=1 146854 ns 146875.5 ns 1.00
integration/byval/slices=3 425195 ns 425040 ns 1.00
integration/byval/reference 144922 ns 144927 ns 1.00
integration/byval/slices=2 286034 ns 286033 ns 1.00
integration/cudadevrt 103411 ns 103435 ns 1.00
kernel/indexing 13985.5 ns 14009 ns 1.00
kernel/indexing_checked 14539 ns 14794 ns 0.98
kernel/occupancy 683.1315789473684 ns 698.5298013245033 ns 0.98
kernel/launch 2104.7 ns 2154 ns 0.98
kernel/rand 15439 ns 18303 ns 0.84
array/reverse/1d 19800 ns 19605 ns 1.01
array/reverse/2d 25299 ns 24620 ns 1.03
array/reverse/1d_inplace 10757.333333333334 ns 10792.666666666666 ns 1.00
array/reverse/2d_inplace 11317 ns 11263 ns 1.00
array/copy 20587 ns 20439 ns 1.01
array/iteration/findall/int 156059.5 ns 155820 ns 1.00
array/iteration/findall/bool 135452 ns 134569 ns 1.01
array/iteration/findfirst/int 154119 ns 154288 ns 1.00
array/iteration/findfirst/bool 153308 ns 153959 ns 1.00
array/iteration/scalar 63684 ns 61548 ns 1.03
array/iteration/logical 205712.5 ns 203707 ns 1.01
array/iteration/findmin/1d 39055.5 ns 38870 ns 1.00
array/iteration/findmin/2d 94055 ns 94333 ns 1.00
array/reductions/reduce/1d 36892.5 ns 30423 ns 1.21
array/reductions/reduce/2d 51584 ns 51457 ns 1.00
array/reductions/mapreduce/1d 34581.5 ns 30142 ns 1.15
array/reductions/mapreduce/2d 42807.5 ns 51380 ns 0.83
array/broadcast 21314 ns 21382 ns 1.00
array/copyto!/gpu_to_gpu 11491 ns 11620 ns 0.99
array/copyto!/cpu_to_gpu 210716 ns 209662 ns 1.01
array/copyto!/gpu_to_cpu 245565 ns 242902.5 ns 1.01
array/accumulate/1d 109016 ns 109331 ns 1.00
array/accumulate/2d 80067 ns 80156 ns 1.00
array/construct 1290.05 ns 1280.3 ns 1.01
array/random/randn/Float32 44320.5 ns 49367 ns 0.90
array/random/randn!/Float32 26341 ns 26244 ns 1.00
array/random/rand!/Int64 27068 ns 27126 ns 1.00
array/random/rand!/Float32 8761.666666666666 ns 8464.333333333334 ns 1.04
array/random/rand/Int64 29899.5 ns 35460 ns 0.84
array/random/rand/Float32 13085 ns 12776 ns 1.02
array/permutedims/4d 67800 ns 67483 ns 1.00
array/permutedims/2d 57014.5 ns 57092.5 ns 1.00
array/permutedims/3d 59619 ns 59419.5 ns 1.00
array/sorting/1d 2776343.5 ns 2776311.5 ns 1.00
array/sorting/by 3367663.5 ns 3367794.5 ns 1.00
array/sorting/2d 1085332 ns 1086101 ns 1.00
cuda/synchronization/stream/auto 1028 ns 1013.0833333333334 ns 1.01
cuda/synchronization/stream/nonblocking 6637.2 ns 6507 ns 1.02
cuda/synchronization/stream/blocking 789.3711340206186 ns 807.4622641509434 ns 0.98
cuda/synchronization/context/auto 1208.4 ns 1212.8 ns 1.00
cuda/synchronization/context/nonblocking 6771.6 ns 6677.8 ns 1.01
cuda/synchronization/context/blocking 907 ns 948.4545454545455 ns 0.96

This comment was automatically generated by workflow using github-action-benchmark.

lib/cublas/wrappers.jl Outdated Show resolved Hide resolved
@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 11, 2025 via email

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 11, 2025

Is the test failure something I've done? Seems GPUArrays related

@kshyatt kshyatt force-pushed the ksh/device_side branch 2 times, most recently from a0829fa to 5d52d10 Compare January 16, 2025 16:05
@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 16, 2025

OK, I think this is ready for review!

@Jutho
Copy link
Contributor

Jutho commented Jan 16, 2025

I am not qualified to review, but certainly interested in the outcome. Will the non-blocking methods only accept CuRef objects for the scalar input or output quantities, or also zero-dimensional arrays (i.e. CuArray{T,0})?

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 16, 2025 via email

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 16, 2025

You can create a CuRefArray{T} where T is some element type from a single element CuVector. In fact, CuRef itself does this under the hood.

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should also improve CuRef to initialize its memory by calling fill instead of memcpy: When calling memcpy, the copy likely won't be truly asynchronous (that would require pinned memory). But if we call fill, which should be possible for most scalars, the argument is passed by value and I think the call will complete asynchronously.
Something to investigate!

Comment on lines 130 to 131
α = convert(T, alpha)
gpu_α = CuRef{T}(α)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The converts can go, CuRef does that for you:

julia> CuRef{Float32}(1)
CUDA.CuRefArray{Float32, CuArray{Float32, 1, CUDA.DeviceMemory}}(Float32[1.0], 1)

α = convert(T, alpha)
gpu_α = CuRef{T}(α)
scal!(n, gpu_α, x)
synchronize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the synchronization? The only way to see the changes by this call is to fetch memory, which is a synchronizing operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For scal! and other functions which don't return a scalar result, I added this to keep the previous behaviour (so that the entire call is synchronous). I'll remove the sync for things like nrm2! that return a scalar which is copied back anyway.

lib/cublas/wrappers.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member

maleadt commented Jan 17, 2025

Something to investigate!

#2625

github-actions[bot]

This comment was marked as off-topic.

Copy link
Contributor

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/lib/cublas/wrappers.jl b/lib/cublas/wrappers.jl
index fe6c78d7e..99a47ba9e 100644
--- a/lib/cublas/wrappers.jl
+++ b/lib/cublas/wrappers.jl
@@ -115,7 +115,7 @@ for (fname, fname_64, elty) in ((:cublasDscal_v2, :cublasDscal_v2_64, :Float64),
                                 (:cublasCscal_v2, :cublasCscal_v2_64, :ComplexF32))
     @eval begin
         function scal!(n::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        x::StridedCuVecOrDenseMat{$elty})
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, alpha, x, stride(x, 1))
@@ -190,8 +190,8 @@ for (jname, fname, fname_64, elty) in ((:dot, :cublasDdot_v2, :cublasDdot_v2_64,
     @eval begin
         function $jname(n::Integer,
                         x::StridedCuVecOrDenseMat{$elty},
-                        y::StridedCuVecOrDenseMat{$elty},
-                        result::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                y::StridedCuVecOrDenseMat{$elty},
+                result::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
             )
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, x, stride(x, 1), y, stride(y, 1), result)
@@ -339,7 +339,7 @@ for (fname, fname_64, elty) in ((:cublasDaxpy_v2, :cublasDaxpy_v2_64, :Float64),
                                 (:cublasCaxpy_v2, :cublasCaxpy_v2_64, :ComplexF32))
     @eval begin
         function axpy!(n::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        dx::StridedCuVecOrDenseMat{$elty},
                        dy::StridedCuVecOrDenseMat{$elty})
             if CUBLAS.version() >= v"12.0"
@@ -472,9 +472,9 @@ for (fname, fname_64, elty) in ((:cublasIdamax_v2, :cublasIdamax_v2_64, :Float64
                                 (:cublasIcamax_v2, :cublasIcamax_v2_64, :ComplexF32))
     @eval begin
         function iamax(n::Integer,
-                       dx::StridedCuVecOrDenseMat{$elty},
-                       result::CuRefArray{Ti, CuVector{Ti, DeviceMemory}},
-                      ) where {Ti <: Integer}
+                dx::StridedCuVecOrDenseMat{$elty},
+                result::CuRefArray{Ti, CuVector{Ti, DeviceMemory}},
+            ) where {Ti <: Integer}
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, dx, stride(dx, 1), result)
             else
@@ -493,9 +493,9 @@ for (fname, fname_64, elty) in ((:cublasIdamin_v2, :cublasIdamin_v2_64, :Float64
                                 (:cublasIcamin_v2, :cublasIcamin_v2_64, :ComplexF32))
     @eval begin
         function iamin(n::Integer,
-                       dx::StridedCuVecOrDenseMat{$elty},
-                       result::CuRefArray{Ti, CuVector{Ti, DeviceMemory}},
-                      ) where {Ti <: Integer}
+                dx::StridedCuVecOrDenseMat{$elty},
+                result::CuRefArray{Ti, CuVector{Ti, DeviceMemory}},
+            ) where {Ti <: Integer}
             if CUBLAS.version() >= v"12.0"
                 $fname_64(handle(), n, dx, stride(dx, 1), result)
             else
@@ -529,10 +529,10 @@ for (fname, fname_64, elty) in ((:cublasDgemv_v2, :cublasDgemv_v2_64, :Float64),
                                 (:cublasCgemv_v2, :cublasCgemv_v2_64, :ComplexF32))
     @eval begin
         function gemv!(trans::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        y::StridedCuVector{$elty})
             # handle trans
             m,n = size(A)
@@ -561,7 +561,7 @@ end
 function gemv(trans::Char, alpha::CuRefArray{T, CuVector{T, DeviceMemory}}, A::StridedCuMatrix{T}, x::StridedCuVector{T}) where {T}
     return gemv!(trans, alpha, A, x, CuRef{T}(zero(T)), similar(x, size(A, (trans == 'N' ? 1 : 2))))
 end
-function gemv(trans::Char, alpha::Number, A::StridedCuMatrix{T}, x::StridedCuVector{T}) where T
+function gemv(trans::Char, alpha::Number, A::StridedCuMatrix{T}, x::StridedCuVector{T}) where {T}
     gemv!(trans, alpha, A, x, zero(T), similar(x, size(A, (trans == 'N' ? 1 : 2))))
 end
 # should this be async?
@@ -579,12 +579,12 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
     )
     @eval begin
         function gemv_batched!(trans::Char,
-                               alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
-                               A::Vector{<:StridedCuMatrix{$eltyin}},
-                               x::Vector{<:StridedCuVector{$eltyin}},
-                               beta::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
-                               y::Vector{<:StridedCuVector{$eltyout}}
-                              )
+                alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
+                A::Vector{<:StridedCuMatrix{$eltyin}},
+                x::Vector{<:StridedCuVector{$eltyin}},
+                beta::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
+                y::Vector{<:StridedCuVector{$eltyout}}
+            )
             if length(A) != length(x) || length(A) != length(y)
                 throw(DimensionMismatch("Lengths of inputs must be the same"))
             end
@@ -615,13 +615,13 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
             y
         end
         function gemv_batched!(
-                               trans::Char,
-                               alpha::Number,
-                               A::Vector{<:StridedCuMatrix{$eltyin}},
-                               x::Vector{<:StridedCuVector{$eltyin}},
-                               beta::Number,
-                               y::Vector{<:StridedCuVector{$eltyout}}
-                              )
+                trans::Char,
+                alpha::Number,
+                A::Vector{<:StridedCuMatrix{$eltyin}},
+                x::Vector{<:StridedCuVector{$eltyin}},
+                beta::Number,
+                y::Vector{<:StridedCuVector{$eltyout}}
+            )
             gpu_α = CuRef{$eltyconst}(alpha)
             gpu_β = CuRef{$eltyconst}(beta)
             y = gemv_batched!(trans, gpu_α, A, x, gpu_β, y)
@@ -641,12 +641,12 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
     )
     @eval begin
         function gemv_strided_batched!(trans::Char,
-                                       alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
-                                       A::AbstractArray{$eltyin, 3},
-                                       x::AbstractArray{$eltyin, 2},
-                                       beta::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
-                                       y::AbstractArray{$eltyout, 2}
-                                      )
+                alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
+                A::AbstractArray{$eltyin, 3},
+                x::AbstractArray{$eltyin, 2},
+                beta::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
+                y::AbstractArray{$eltyout, 2}
+            )
             if size(A, 3) != size(x, 2) || size(A, 3) != size(y, 2)
                 throw(DimensionMismatch("Batch sizes must be equal for all inputs"))
             end
@@ -671,13 +671,13 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
             y
         end
         function gemv_strided_batched!(
-                                       trans::Char,
-                                       alpha::Number,
-                                       A::AbstractArray{$eltyin, 3},
-                                       x::AbstractArray{$eltyin, 2},
-                                       beta::Number,
-                                       y::AbstractArray{$eltyout, 2}
-                                      )
+                trans::Char,
+                alpha::Number,
+                A::AbstractArray{$eltyin, 3},
+                x::AbstractArray{$eltyin, 2},
+                beta::Number,
+                y::AbstractArray{$eltyout, 2}
+            )
             gpu_α = CuRef{$eltyconst}(alpha)
             gpu_β = CuRef{$eltyconst}(beta)
             y = gemv_strided_batched!(trans, gpu_α, A, x, gpu_β, y)
@@ -697,10 +697,10 @@ for (fname, fname_64, elty) in ((:cublasDgbmv_v2, :cublasDgbmv_v2_64, :Float64),
                        m::Integer,
                        kl::Integer,
                        ku::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        y::StridedCuVector{$elty})
             n = size(A,2)
             # check dimensions
@@ -716,16 +716,17 @@ for (fname, fname_64, elty) in ((:cublasDgbmv_v2, :cublasDgbmv_v2_64, :Float64),
             end
             y
         end
-        function gbmv!(trans::Char,
-                       m::Integer,
-                       kl::Integer,
-                       ku::Integer,
-                       alpha::Number,
-                       A::StridedCuMatrix{$elty},
-                       x::StridedCuVector{$elty},
-                       beta::Number,
-                       y::StridedCuVector{$elty}
-                      )
+        function gbmv!(
+                trans::Char,
+                m::Integer,
+                kl::Integer,
+                ku::Integer,
+                alpha::Number,
+                A::StridedCuMatrix{$elty},
+                x::StridedCuVector{$elty},
+                beta::Number,
+                y::StridedCuVector{$elty}
+            )
 
             gpu_α = CuRef{$elty}(alpha)
             gpu_β = CuRef{$elty}(beta)
@@ -735,8 +736,10 @@ for (fname, fname_64, elty) in ((:cublasDgbmv_v2, :cublasDgbmv_v2_64, :Float64),
         end
     end
 end
-function gbmv(trans::Char, m::Integer, kl::Integer, ku::Integer, alpha::CuVector{T},
-              A::StridedCuMatrix{T}, x::StridedCuVector{T}) where {T}
+function gbmv(
+        trans::Char, m::Integer, kl::Integer, ku::Integer, alpha::CuVector{T},
+        A::StridedCuMatrix{T}, x::StridedCuVector{T}
+    ) where {T}
     # TODO: fix gbmv bug in julia
     n = size(A, 2)
     leny = trans == 'N' ? m : n
@@ -759,10 +762,10 @@ for (fname, fname_64, elty) in ((:cublasDspmv_v2, :cublasDspmv_v2_64, :Float64),
                                 (:cublasSspmv_v2, :cublasSspmv_v2_64, :Float32))
     @eval begin
         function spmv!(uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        AP::StridedCuVector{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        y::StridedCuVector{$elty})
             n = round(Int, (sqrt(8*length(AP))-1)/2)
             if n != length(x) || n != length(y) throw(DimensionMismatch("")) end
@@ -777,21 +780,24 @@ for (fname, fname_64, elty) in ((:cublasDspmv_v2, :cublasDspmv_v2_64, :Float64),
         end
     end
 end
-function spmv!(uplo::Char,
-               alpha::Number,
-               AP::StridedCuVector{T},
-               x::StridedCuVector{T},
-               beta::Number,
-               y::StridedCuVector{T}
-              ) where {T}
+function spmv!(
+        uplo::Char,
+        alpha::Number,
+        AP::StridedCuVector{T},
+        x::StridedCuVector{T},
+        beta::Number,
+        y::StridedCuVector{T}
+    ) where {T}
     gpu_α = CuRef{T}(alpha)
     gpu_β = CuRef{T}(beta)
     y = spmv!(uplo, gpu_α, AP, x, gpu_β, y)
     synchronize()
     return y
 end
-function spmv(uplo::Char, alpha::CuVector{T},
-              AP::StridedCuVector{T}, x::StridedCuVector{T}) where {T}
+function spmv(
+        uplo::Char, alpha::CuVector{T},
+        AP::StridedCuVector{T}, x::StridedCuVector{T}
+    ) where {T}
     return spmv!(uplo, alpha, AP, x, CuRef{T}(zero(T)), similar(x))
 end
 function spmv(uplo::Char, alpha::Number,
@@ -810,10 +816,10 @@ for (fname, fname_64, elty) in ((:cublasDsymv_v2, :cublasDsymv_v2_64, :Float64),
     # Note that the complex symv are not BLAS but auiliary functions in LAPACK
     @eval begin
         function symv!(uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        y::StridedCuVector{$elty})
             m, n = size(A)
             if m != n throw(DimensionMismatch("Matrix A is $m by $n but must be square")) end
@@ -864,10 +870,10 @@ for (fname, fname_64, elty) in ((:cublasZhemv_v2, :cublasZhemv_v2_64, :ComplexF6
                                 (:cublasChemv_v2, :cublasChemv_v2_64, :ComplexF32))
     @eval begin
         function hemv!(uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        y::StridedCuVector{$elty})
             # TODO: fix dimension check bug in julia
             m, n = size(A)
@@ -922,10 +928,10 @@ for (fname, fname_64, elty) in ((:cublasDsbmv_v2, :cublasDsbmv_v2_64, :Float64),
     @eval begin
         function sbmv!(uplo::Char,
                        k::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        y::StridedCuVector{$elty})
             m, n = size(A)
             #if m != n throw(DimensionMismatch("Matrix A is $m by $n but must be square")) end
@@ -981,10 +987,10 @@ for (fname, fname_64, elty) in ((:cublasZhbmv_v2, :cublasZhbmv_v2_64, :ComplexF6
     @eval begin
         function hbmv!(uplo::Char,
                        k::Integer,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
                        x::StridedCuVector{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        y::StridedCuVector{$elty})
             m, n = size(A)
             if !(1<=(1+k)<=n) throw(DimensionMismatch("Incorrect number of bands")) end
@@ -1168,7 +1174,7 @@ for (fname, fname_64, elty) in ((:cublasDger_v2, :cublasDger_v2_64, :Float64),
                                 (:cublasCgerc_v2, :cublasCgerc_v2_64, :ComplexF32))
     @eval begin
         function ger!(
-                      alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                       x::StridedCuVector{$elty},
                       y::StridedCuVector{$elty},
                       A::StridedCuMatrix{$elty})
@@ -1204,7 +1210,7 @@ for (fname, fname_64, elty) in ((:cublasDspr_v2, :cublasDspr_v2_64, :Float64),
                                 (:cublasSspr_v2, :cublasSspr_v2_64, :Float32))
     @eval begin
         function spr!(uplo::Char,
-                      alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                       x::StridedCuVector{$elty},
                       AP::StridedCuVector{$elty})
             n = round(Int, (sqrt(8*length(AP))-1)/2)
@@ -1238,7 +1244,7 @@ for (fname, fname_64, elty) in ((:cublasDsyr_v2, :cublasDsyr_v2_64, :Float64),
                                 (:cublasCsyr_v2, :cublasCsyr_v2_64, :ComplexF32))
     @eval begin
         function syr!(uplo::Char,
-                      alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                       x::StridedCuVector{$elty},
                       A::StridedCuMatrix{$elty})
             m, n = size(A)
@@ -1274,7 +1280,7 @@ for (fname, fname_64, elty, relty) in (
     )
     @eval begin
         function her!(uplo::Char,
-                      alpha::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
+                alpha::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
                       x::StridedCuVector{$elty},
                       A::StridedCuMatrix{$elty})
             m, n = size(A)
@@ -1308,11 +1314,11 @@ for (fname, fname_64, elty) in ((:cublasZher2_v2, :cublasZher2_v2_64, :ComplexF6
                                 (:cublasCher2_v2, :cublasCher2_v2_64, :ComplexF32))
     @eval begin
         function her2!(uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
-                       x::StridedCuVector{$elty},
-                       y::StridedCuVector{$elty},
-                       A::StridedCuMatrix{$elty}
-                      )
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                x::StridedCuVector{$elty},
+                y::StridedCuVector{$elty},
+                A::StridedCuMatrix{$elty}
+            )
             m, n = size(A)
             m == n || throw(DimensionMismatch("Matrix A is $m by $n but must be square"))
             length(x) == n || throw(DimensionMismatch("Length of vector must be the same as the matrix dimensions"))
@@ -1352,10 +1358,10 @@ for (fname, fname_64, elty) in ((:cublasDgemm_v2, :cublasDgemm_v2_64, :Float64),
     @eval begin
         function gemm!(transA::Char,
                        transB::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuVecOrMat{$elty},
                        B::StridedCuVecOrMat{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        C::StridedCuVecOrMat{$elty})
             m = size(A, transA == 'N' ? 1 : 2)
             k = size(A, transA == 'N' ? 2 : 1)
@@ -1493,10 +1499,10 @@ function gemmExComputeType(TA, TB, TC, m, k, n)
 end
 
 function gemmEx!(transA::Char, transB::Char,
-                 @nospecialize(alpha::CuRefArray),
+        @nospecialize(alpha::CuRefArray),
                  @nospecialize(A::StridedCuVecOrMat),
                  @nospecialize(B::StridedCuVecOrMat),
-                 @nospecialize(beta::CuRefArray),
+        @nospecialize(beta::CuRefArray),
                  @nospecialize(C::StridedCuVecOrMat);
                  algo::cublasGemmAlgo_t=CUBLAS_GEMM_DEFAULT)
     m = size(A, transA == 'N' ? 1 : 2)
@@ -1551,10 +1557,10 @@ end
 
 # TODO for device mode pointers
 function gemmBatchedEx!(transA::Char, transB::Char,
-                 @nospecialize(alpha::CuRefArray),
+        @nospecialize(alpha::CuRefArray),
                  @nospecialize(A::Vector{<:StridedCuVecOrMat}),
                  @nospecialize(B::Vector{<:StridedCuVecOrMat}),
-                 @nospecialize(beta::CuRefArray),
+        @nospecialize(beta::CuRefArray),
                  @nospecialize(C::Vector{<:StridedCuVecOrMat});
                  algo::cublasGemmAlgo_t=CUBLAS_GEMM_DEFAULT)
     if length(A) != length(B) || length(A) != length(C)
@@ -1622,11 +1628,11 @@ function gemmBatchedEx!(
 end
 
 function gemmStridedBatchedEx!(
-                 transA::Char, transB::Char,
-                 @nospecialize(alpha::CuRefArray),
+        transA::Char, transB::Char,
+        @nospecialize(alpha::CuRefArray),
                  @nospecialize(A::AbstractArray{Ta, 3}),
                  @nospecialize(B::AbstractArray{Tb, 3}),
-                 @nospecialize(beta::CuRefArray),
+        @nospecialize(beta::CuRefArray),
                  @nospecialize(C::AbstractArray{Tc, 3});
                  algo::cublasGemmAlgo_t=CUBLAS_GEMM_DEFAULT) where {Ta, Tb, Tc}
     if size(A, 3) != size(B, 3) || size(A, 3) != size(C, 3)
@@ -1865,10 +1871,10 @@ for (fname, fname_64, elty) in ((:cublasDgemmBatched, :cublasDgemmBatched_64, :F
     @eval begin
         function gemm_batched!(transA::Char,
                                transB::Char,
-                               alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                                A::Vector{<:StridedCuMatrix{$elty}},
                                B::Vector{<:StridedCuMatrix{$elty}},
-                               beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                                C::Vector{<:StridedCuMatrix{$elty}})
             if length(A) != length(B) || length(A) != length(C)
                 throw(DimensionMismatch(""))
@@ -1948,10 +1954,10 @@ for (fname, fname_64, elty) in ((:cublasDgemmStridedBatched, :cublasDgemmStrided
     @eval begin
         function gemm_strided_batched!(transA::Char,
                                transB::Char,
-                               alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                                A::AbstractArray{$elty, 3}, # allow PermutedDimsArray
                                B::AbstractArray{$elty, 3},
-                               beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                                C::AbstractArray{$elty, 3})
            m = size(A, transA == 'N' ? 1 : 2)
            k = size(A, transA == 'N' ? 2 : 1)
@@ -2031,10 +2037,10 @@ for (fname, fname_64, elty) in ((:cublasDsymm_v2, :cublasDsymm_v2_64, :Float64),
     @eval begin
         function symm!(side::Char,
                        uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
                        B::StridedCuMatrix{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        C::StridedCuMatrix{$elty})
             k, nA = size(A)
             if k != nA throw(DimensionMismatch("Matrix A must be square")) end
@@ -2093,9 +2099,9 @@ for (fname, fname_64, elty) in ((:cublasDsyrk_v2, :cublasDsyrk_v2_64, :Float64),
     @eval begin
         function syrk!(uplo::Char,
                        trans::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuVecOrMat{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        C::StridedCuMatrix{$elty})
             mC, n = size(C)
             if mC != n throw(DimensionMismatch("C must be square")) end
@@ -2146,10 +2152,10 @@ for (fname, fname_64, elty) in ((:cublasDsyrkx, :cublasDsyrkx_64, :Float64),
     @eval begin
         function syrkx!(uplo::Char,
                        trans::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuVecOrMat{$elty},
                        B::StridedCuVecOrMat{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        C::StridedCuMatrix{$elty})
             mC, n = size(C)
             if mC != n throw(DimensionMismatch("C must be square")) end
@@ -2205,10 +2211,10 @@ for (fname, fname_64, elty) in ((:cublasZhemm_v2, :cublasZhemm_v2_64, :ComplexF6
     @eval begin
         function hemm!(side::Char,
                        uplo::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
                        B::StridedCuMatrix{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        C::StridedCuMatrix{$elty})
             mA, nA = size(A)
             m, n = size(B)
@@ -2268,9 +2274,9 @@ for (fname, fname_64, elty, relty) in (
     @eval begin
         function herk!(uplo::Char,
                        trans::Char,
-                       alpha::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
+                alpha::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
                        A::StridedCuVecOrMat{$elty},
-                       beta::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
+                beta::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
                        C::StridedCuMatrix{$elty})
             mC, n = size(C)
             if mC != n throw(DimensionMismatch("C must be square")) end
@@ -2327,10 +2333,10 @@ for (fname, fname_64, elty) in ((:cublasDsyr2k_v2, :cublasDsyr2k_v2_64, :Float64
     @eval begin
         function syr2k!(uplo::Char,
                         trans::Char,
-                        alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                         A::StridedCuVecOrMat{$elty},
                         B::StridedCuVecOrMat{$elty},
-                        beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                         C::StridedCuMatrix{$elty})
             # TODO: check size of B in julia (syr2k!)
             m, n = size(C)
@@ -2386,7 +2392,7 @@ function syr2k(uplo::Char,
                B::StridedCuVecOrMat)
     T = eltype(A)
     n = size(A, trans == 'N' ? 1 : 2)
-    syr2k!(uplo, trans, convert(T, alpha), A, B, zero(T), similar(A, T, (n, n)))
+    return syr2k!(uplo, trans, convert(T, alpha), A, B, zero(T), similar(A, T, (n, n)))
 end
 function syr2k(uplo::Char, trans::Char, A::StridedCuVecOrMat, B::StridedCuVecOrMat)
     syr2k(uplo, trans, one(eltype(A)), A, B)
@@ -2400,10 +2406,10 @@ for (fname, fname_64, elty, relty) in (
     @eval begin
         function her2k!(uplo::Char,
                         trans::Char,
-                        alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                         A::StridedCuVecOrMat{$elty},
                         B::StridedCuVecOrMat{$elty},
-                        beta::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
+                beta::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
                         C::StridedCuMatrix{$elty})
             # TODO: check size of B in julia (her2k!)
             m, n = size(C)
@@ -2477,7 +2483,7 @@ for (mmname, smname, elty) in
                        uplo::Char,
                        transa::Char,
                        diag::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
                        B::StridedCuMatrix{$elty},
                        C::StridedCuMatrix{$elty})
@@ -2499,7 +2505,7 @@ for (mmname, smname, elty) in
                        uplo::Char,
                        transa::Char,
                        diag::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
                        B::StridedCuMatrix{$elty})
             m, n = size(B)
@@ -2564,7 +2570,7 @@ for (fname, fname_64, elty) in ((:cublasDtrsmBatched, :cublasDtrsmBatched_64, :F
                                uplo::Char,
                                transa::Char,
                                diag::Char,
-                               alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                                A::Vector{<:StridedCuMatrix{$elty}},
                                B::Vector{<:StridedCuMatrix{$elty}})
             if length(A) != length(B)
@@ -2620,9 +2626,9 @@ for (fname, fname_64, elty) in ((:cublasDgeam, :cublasDgeam_64, :Float64),
     @eval begin
         function geam!(transa::Char,
                        transb::Char,
-                       alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        A::StridedCuMatrix{$elty},
-                       beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+                beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
                        B::StridedCuMatrix{$elty},
                        C::StridedCuMatrix{$elty})
             mA, nA = size(A)
@@ -2860,8 +2866,9 @@ for (fname, elty) in ((:cublasDgetriBatched, :Float64),
         end
 
         function getri_batched!(n, Aptrs::CuVector{CuPtr{$elty}},
-                                lda, Cptrs::CuVector{CuPtr{$elty}},ldc,
-                                pivotArray::CuArray{Cint})
+                lda, Cptrs::CuVector{CuPtr{$elty}}, ldc,
+                pivotArray::CuArray{Cint}
+            )
             batchSize = length(Aptrs)
             info = CuArray{Cint}(undef, batchSize)
             $fname(handle(), n, Aptrs, lda, pivotArray, Cptrs, ldc, info, batchSize)

@maleadt
Copy link
Member

maleadt commented Jan 20, 2025

CI failures seem relevant.

Feel free to ignore the formatter; I made it less spammy 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants