-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Use non-blocking device side pointer mode in CUBLAS, with fallbacks #2616
base: master
Are you sure you want to change the base?
Conversation
I can also add some more |
Sample speedup: julia> using CUDA, CUDA.CUBLAS, LinearAlgebra;
julia> n = Int(2^26);
julia> X = CUDA.rand(Float64, n);
julia> res = CuRef{Float64}(0.0);
# do some precompilation runs first
julia> @time CUBLAS.nrm2(n, X, res);
0.000104 seconds (18 allocations: 288 bytes)
julia> @time CUBLAS.nrm2(n, X);
0.001564 seconds (73 allocations: 3.094 KiB)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: fd59678 | Previous: 3d45d85 | Ratio |
---|---|---|---|
latency/precompile |
45680359374.5 ns |
45362897043 ns |
1.01 |
latency/ttfp |
6379513705 ns |
6376155312.5 ns |
1.00 |
latency/import |
3037486742 ns |
3036001837 ns |
1.00 |
integration/volumerhs |
9568366 ns |
9568516 ns |
1.00 |
integration/byval/slices=1 |
146854 ns |
146875.5 ns |
1.00 |
integration/byval/slices=3 |
425195 ns |
425040 ns |
1.00 |
integration/byval/reference |
144922 ns |
144927 ns |
1.00 |
integration/byval/slices=2 |
286034 ns |
286033 ns |
1.00 |
integration/cudadevrt |
103411 ns |
103435 ns |
1.00 |
kernel/indexing |
13985.5 ns |
14009 ns |
1.00 |
kernel/indexing_checked |
14539 ns |
14794 ns |
0.98 |
kernel/occupancy |
683.1315789473684 ns |
698.5298013245033 ns |
0.98 |
kernel/launch |
2104.7 ns |
2154 ns |
0.98 |
kernel/rand |
15439 ns |
18303 ns |
0.84 |
array/reverse/1d |
19800 ns |
19605 ns |
1.01 |
array/reverse/2d |
25299 ns |
24620 ns |
1.03 |
array/reverse/1d_inplace |
10757.333333333334 ns |
10792.666666666666 ns |
1.00 |
array/reverse/2d_inplace |
11317 ns |
11263 ns |
1.00 |
array/copy |
20587 ns |
20439 ns |
1.01 |
array/iteration/findall/int |
156059.5 ns |
155820 ns |
1.00 |
array/iteration/findall/bool |
135452 ns |
134569 ns |
1.01 |
array/iteration/findfirst/int |
154119 ns |
154288 ns |
1.00 |
array/iteration/findfirst/bool |
153308 ns |
153959 ns |
1.00 |
array/iteration/scalar |
63684 ns |
61548 ns |
1.03 |
array/iteration/logical |
205712.5 ns |
203707 ns |
1.01 |
array/iteration/findmin/1d |
39055.5 ns |
38870 ns |
1.00 |
array/iteration/findmin/2d |
94055 ns |
94333 ns |
1.00 |
array/reductions/reduce/1d |
36892.5 ns |
30423 ns |
1.21 |
array/reductions/reduce/2d |
51584 ns |
51457 ns |
1.00 |
array/reductions/mapreduce/1d |
34581.5 ns |
30142 ns |
1.15 |
array/reductions/mapreduce/2d |
42807.5 ns |
51380 ns |
0.83 |
array/broadcast |
21314 ns |
21382 ns |
1.00 |
array/copyto!/gpu_to_gpu |
11491 ns |
11620 ns |
0.99 |
array/copyto!/cpu_to_gpu |
210716 ns |
209662 ns |
1.01 |
array/copyto!/gpu_to_cpu |
245565 ns |
242902.5 ns |
1.01 |
array/accumulate/1d |
109016 ns |
109331 ns |
1.00 |
array/accumulate/2d |
80067 ns |
80156 ns |
1.00 |
array/construct |
1290.05 ns |
1280.3 ns |
1.01 |
array/random/randn/Float32 |
44320.5 ns |
49367 ns |
0.90 |
array/random/randn!/Float32 |
26341 ns |
26244 ns |
1.00 |
array/random/rand!/Int64 |
27068 ns |
27126 ns |
1.00 |
array/random/rand!/Float32 |
8761.666666666666 ns |
8464.333333333334 ns |
1.04 |
array/random/rand/Int64 |
29899.5 ns |
35460 ns |
0.84 |
array/random/rand/Float32 |
13085 ns |
12776 ns |
1.02 |
array/permutedims/4d |
67800 ns |
67483 ns |
1.00 |
array/permutedims/2d |
57014.5 ns |
57092.5 ns |
1.00 |
array/permutedims/3d |
59619 ns |
59419.5 ns |
1.00 |
array/sorting/1d |
2776343.5 ns |
2776311.5 ns |
1.00 |
array/sorting/by |
3367663.5 ns |
3367794.5 ns |
1.00 |
array/sorting/2d |
1085332 ns |
1086101 ns |
1.00 |
cuda/synchronization/stream/auto |
1028 ns |
1013.0833333333334 ns |
1.01 |
cuda/synchronization/stream/nonblocking |
6637.2 ns |
6507 ns |
1.02 |
cuda/synchronization/stream/blocking |
789.3711340206186 ns |
807.4622641509434 ns |
0.98 |
cuda/synchronization/context/auto |
1208.4 ns |
1212.8 ns |
1.00 |
cuda/synchronization/context/nonblocking |
6771.6 ns |
6677.8 ns |
1.01 |
cuda/synchronization/context/blocking |
907 ns |
948.4545454545455 ns |
0.96 |
This comment was automatically generated by workflow using github-action-benchmark.
Yeah, should one of us open an issue?
…On Sat, Jan 11, 2025 at 2:48 AM Tim Besard ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lib/cublas/wrappers.jl
<#2616 (comment)>:
> function scal!(n::Integer, alpha::Number, x::StridedCuVecOrDenseMat{Float16})
- α = convert(Float32, alpha)
- cublasScalEx(handle(), n, Ref{Float32}(α), Float32, x, Float16, stride(x, 1), Float32)
+ α = CuRef{Float32}( convert(Float32, alpha) )
We should improve CuRef so that it can be constructed identically to Ref.
Ref{T}(x) doing an implicit convert is pretty convenient.
—
Reply to this email directly, view it on GitHub
<#2616 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGKJY7VKNAPMMZTTKAF2YT2KDEFVAVCNFSM6AAAAABU7EYIIGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKNBUGU4DSOBRGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Is the test failure something I've done? Seems GPUArrays related |
a0829fa
to
5d52d10
Compare
OK, I think this is ready for review! |
I am not qualified to review, but certainly interested in the outcome. Will the non-blocking methods only accept |
For now only CuRef but these are easy to create (it’s exported by CUDA.jl).
I think one can also create them without a copy from a regular CuArray?
…On Thu, Jan 16, 2025 at 3:41 PM Jutho ***@***.***> wrote:
I am not qualified to review, but certainly interested in the outcome.
Will the non-blocking methods only accept CuRef objects for the scalar
input or output quantities, or also zero-dimensional arrays (i.e.
CuArray{T,0})?
—
Reply to this email directly, view it on GitHub
<#2616 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGKJYYFBVIOILWK4G4PORD2LAKPLAVCNFSM6AAAAABU7EYIIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJWHA2DSMBXG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
You can create a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should also improve CuRef
to initialize its memory by calling fill
instead of memcpy
: When calling memcpy
, the copy likely won't be truly asynchronous (that would require pinned memory). But if we call fill
, which should be possible for most scalars, the argument is passed by value and I think the call will complete asynchronously.
Something to investigate!
lib/cublas/wrappers.jl
Outdated
α = convert(T, alpha) | ||
gpu_α = CuRef{T}(α) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The converts can go, CuRef
does that for you:
julia> CuRef{Float32}(1)
CUDA.CuRefArray{Float32, CuArray{Float32, 1, CUDA.DeviceMemory}}(Float32[1.0], 1)
α = convert(T, alpha) | ||
gpu_α = CuRef{T}(α) | ||
scal!(n, gpu_α, x) | ||
synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the synchronization? The only way to see the changes by this call is to fetch memory, which is a synchronizing operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For scal!
and other functions which don't return a scalar result, I added this to keep the previous behaviour (so that the entire call is synchronous). I'll remove the sync for things like nrm2!
that return a scalar which is copied back anyway.
|
829083e
to
fd59678
Compare
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/lib/cublas/wrappers.jl b/lib/cublas/wrappers.jl
index fe6c78d7e..99a47ba9e 100644
--- a/lib/cublas/wrappers.jl
+++ b/lib/cublas/wrappers.jl
@@ -115,7 +115,7 @@ for (fname, fname_64, elty) in ((:cublasDscal_v2, :cublasDscal_v2_64, :Float64),
(:cublasCscal_v2, :cublasCscal_v2_64, :ComplexF32))
@eval begin
function scal!(n::Integer,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
x::StridedCuVecOrDenseMat{$elty})
if CUBLAS.version() >= v"12.0"
$fname_64(handle(), n, alpha, x, stride(x, 1))
@@ -190,8 +190,8 @@ for (jname, fname, fname_64, elty) in ((:dot, :cublasDdot_v2, :cublasDdot_v2_64,
@eval begin
function $jname(n::Integer,
x::StridedCuVecOrDenseMat{$elty},
- y::StridedCuVecOrDenseMat{$elty},
- result::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ y::StridedCuVecOrDenseMat{$elty},
+ result::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
)
if CUBLAS.version() >= v"12.0"
$fname_64(handle(), n, x, stride(x, 1), y, stride(y, 1), result)
@@ -339,7 +339,7 @@ for (fname, fname_64, elty) in ((:cublasDaxpy_v2, :cublasDaxpy_v2_64, :Float64),
(:cublasCaxpy_v2, :cublasCaxpy_v2_64, :ComplexF32))
@eval begin
function axpy!(n::Integer,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
dx::StridedCuVecOrDenseMat{$elty},
dy::StridedCuVecOrDenseMat{$elty})
if CUBLAS.version() >= v"12.0"
@@ -472,9 +472,9 @@ for (fname, fname_64, elty) in ((:cublasIdamax_v2, :cublasIdamax_v2_64, :Float64
(:cublasIcamax_v2, :cublasIcamax_v2_64, :ComplexF32))
@eval begin
function iamax(n::Integer,
- dx::StridedCuVecOrDenseMat{$elty},
- result::CuRefArray{Ti, CuVector{Ti, DeviceMemory}},
- ) where {Ti <: Integer}
+ dx::StridedCuVecOrDenseMat{$elty},
+ result::CuRefArray{Ti, CuVector{Ti, DeviceMemory}},
+ ) where {Ti <: Integer}
if CUBLAS.version() >= v"12.0"
$fname_64(handle(), n, dx, stride(dx, 1), result)
else
@@ -493,9 +493,9 @@ for (fname, fname_64, elty) in ((:cublasIdamin_v2, :cublasIdamin_v2_64, :Float64
(:cublasIcamin_v2, :cublasIcamin_v2_64, :ComplexF32))
@eval begin
function iamin(n::Integer,
- dx::StridedCuVecOrDenseMat{$elty},
- result::CuRefArray{Ti, CuVector{Ti, DeviceMemory}},
- ) where {Ti <: Integer}
+ dx::StridedCuVecOrDenseMat{$elty},
+ result::CuRefArray{Ti, CuVector{Ti, DeviceMemory}},
+ ) where {Ti <: Integer}
if CUBLAS.version() >= v"12.0"
$fname_64(handle(), n, dx, stride(dx, 1), result)
else
@@ -529,10 +529,10 @@ for (fname, fname_64, elty) in ((:cublasDgemv_v2, :cublasDgemv_v2_64, :Float64),
(:cublasCgemv_v2, :cublasCgemv_v2_64, :ComplexF32))
@eval begin
function gemv!(trans::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
x::StridedCuVector{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
y::StridedCuVector{$elty})
# handle trans
m,n = size(A)
@@ -561,7 +561,7 @@ end
function gemv(trans::Char, alpha::CuRefArray{T, CuVector{T, DeviceMemory}}, A::StridedCuMatrix{T}, x::StridedCuVector{T}) where {T}
return gemv!(trans, alpha, A, x, CuRef{T}(zero(T)), similar(x, size(A, (trans == 'N' ? 1 : 2))))
end
-function gemv(trans::Char, alpha::Number, A::StridedCuMatrix{T}, x::StridedCuVector{T}) where T
+function gemv(trans::Char, alpha::Number, A::StridedCuMatrix{T}, x::StridedCuVector{T}) where {T}
gemv!(trans, alpha, A, x, zero(T), similar(x, size(A, (trans == 'N' ? 1 : 2))))
end
# should this be async?
@@ -579,12 +579,12 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
)
@eval begin
function gemv_batched!(trans::Char,
- alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
- A::Vector{<:StridedCuMatrix{$eltyin}},
- x::Vector{<:StridedCuVector{$eltyin}},
- beta::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
- y::Vector{<:StridedCuVector{$eltyout}}
- )
+ alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
+ A::Vector{<:StridedCuMatrix{$eltyin}},
+ x::Vector{<:StridedCuVector{$eltyin}},
+ beta::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
+ y::Vector{<:StridedCuVector{$eltyout}}
+ )
if length(A) != length(x) || length(A) != length(y)
throw(DimensionMismatch("Lengths of inputs must be the same"))
end
@@ -615,13 +615,13 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
y
end
function gemv_batched!(
- trans::Char,
- alpha::Number,
- A::Vector{<:StridedCuMatrix{$eltyin}},
- x::Vector{<:StridedCuVector{$eltyin}},
- beta::Number,
- y::Vector{<:StridedCuVector{$eltyout}}
- )
+ trans::Char,
+ alpha::Number,
+ A::Vector{<:StridedCuMatrix{$eltyin}},
+ x::Vector{<:StridedCuVector{$eltyin}},
+ beta::Number,
+ y::Vector{<:StridedCuVector{$eltyout}}
+ )
gpu_α = CuRef{$eltyconst}(alpha)
gpu_β = CuRef{$eltyconst}(beta)
y = gemv_batched!(trans, gpu_α, A, x, gpu_β, y)
@@ -641,12 +641,12 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
)
@eval begin
function gemv_strided_batched!(trans::Char,
- alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
- A::AbstractArray{$eltyin, 3},
- x::AbstractArray{$eltyin, 2},
- beta::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
- y::AbstractArray{$eltyout, 2}
- )
+ alpha::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
+ A::AbstractArray{$eltyin, 3},
+ x::AbstractArray{$eltyin, 2},
+ beta::CuRefArray{$eltyconst, CuVector{$eltyconst, DeviceMemory}},
+ y::AbstractArray{$eltyout, 2}
+ )
if size(A, 3) != size(x, 2) || size(A, 3) != size(y, 2)
throw(DimensionMismatch("Batch sizes must be equal for all inputs"))
end
@@ -671,13 +671,13 @@ for (fname, fname_64, eltyin, eltyout, eltyconst) in (
y
end
function gemv_strided_batched!(
- trans::Char,
- alpha::Number,
- A::AbstractArray{$eltyin, 3},
- x::AbstractArray{$eltyin, 2},
- beta::Number,
- y::AbstractArray{$eltyout, 2}
- )
+ trans::Char,
+ alpha::Number,
+ A::AbstractArray{$eltyin, 3},
+ x::AbstractArray{$eltyin, 2},
+ beta::Number,
+ y::AbstractArray{$eltyout, 2}
+ )
gpu_α = CuRef{$eltyconst}(alpha)
gpu_β = CuRef{$eltyconst}(beta)
y = gemv_strided_batched!(trans, gpu_α, A, x, gpu_β, y)
@@ -697,10 +697,10 @@ for (fname, fname_64, elty) in ((:cublasDgbmv_v2, :cublasDgbmv_v2_64, :Float64),
m::Integer,
kl::Integer,
ku::Integer,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
x::StridedCuVector{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
y::StridedCuVector{$elty})
n = size(A,2)
# check dimensions
@@ -716,16 +716,17 @@ for (fname, fname_64, elty) in ((:cublasDgbmv_v2, :cublasDgbmv_v2_64, :Float64),
end
y
end
- function gbmv!(trans::Char,
- m::Integer,
- kl::Integer,
- ku::Integer,
- alpha::Number,
- A::StridedCuMatrix{$elty},
- x::StridedCuVector{$elty},
- beta::Number,
- y::StridedCuVector{$elty}
- )
+ function gbmv!(
+ trans::Char,
+ m::Integer,
+ kl::Integer,
+ ku::Integer,
+ alpha::Number,
+ A::StridedCuMatrix{$elty},
+ x::StridedCuVector{$elty},
+ beta::Number,
+ y::StridedCuVector{$elty}
+ )
gpu_α = CuRef{$elty}(alpha)
gpu_β = CuRef{$elty}(beta)
@@ -735,8 +736,10 @@ for (fname, fname_64, elty) in ((:cublasDgbmv_v2, :cublasDgbmv_v2_64, :Float64),
end
end
end
-function gbmv(trans::Char, m::Integer, kl::Integer, ku::Integer, alpha::CuVector{T},
- A::StridedCuMatrix{T}, x::StridedCuVector{T}) where {T}
+function gbmv(
+ trans::Char, m::Integer, kl::Integer, ku::Integer, alpha::CuVector{T},
+ A::StridedCuMatrix{T}, x::StridedCuVector{T}
+ ) where {T}
# TODO: fix gbmv bug in julia
n = size(A, 2)
leny = trans == 'N' ? m : n
@@ -759,10 +762,10 @@ for (fname, fname_64, elty) in ((:cublasDspmv_v2, :cublasDspmv_v2_64, :Float64),
(:cublasSspmv_v2, :cublasSspmv_v2_64, :Float32))
@eval begin
function spmv!(uplo::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
AP::StridedCuVector{$elty},
x::StridedCuVector{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
y::StridedCuVector{$elty})
n = round(Int, (sqrt(8*length(AP))-1)/2)
if n != length(x) || n != length(y) throw(DimensionMismatch("")) end
@@ -777,21 +780,24 @@ for (fname, fname_64, elty) in ((:cublasDspmv_v2, :cublasDspmv_v2_64, :Float64),
end
end
end
-function spmv!(uplo::Char,
- alpha::Number,
- AP::StridedCuVector{T},
- x::StridedCuVector{T},
- beta::Number,
- y::StridedCuVector{T}
- ) where {T}
+function spmv!(
+ uplo::Char,
+ alpha::Number,
+ AP::StridedCuVector{T},
+ x::StridedCuVector{T},
+ beta::Number,
+ y::StridedCuVector{T}
+ ) where {T}
gpu_α = CuRef{T}(alpha)
gpu_β = CuRef{T}(beta)
y = spmv!(uplo, gpu_α, AP, x, gpu_β, y)
synchronize()
return y
end
-function spmv(uplo::Char, alpha::CuVector{T},
- AP::StridedCuVector{T}, x::StridedCuVector{T}) where {T}
+function spmv(
+ uplo::Char, alpha::CuVector{T},
+ AP::StridedCuVector{T}, x::StridedCuVector{T}
+ ) where {T}
return spmv!(uplo, alpha, AP, x, CuRef{T}(zero(T)), similar(x))
end
function spmv(uplo::Char, alpha::Number,
@@ -810,10 +816,10 @@ for (fname, fname_64, elty) in ((:cublasDsymv_v2, :cublasDsymv_v2_64, :Float64),
# Note that the complex symv are not BLAS but auiliary functions in LAPACK
@eval begin
function symv!(uplo::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
x::StridedCuVector{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
y::StridedCuVector{$elty})
m, n = size(A)
if m != n throw(DimensionMismatch("Matrix A is $m by $n but must be square")) end
@@ -864,10 +870,10 @@ for (fname, fname_64, elty) in ((:cublasZhemv_v2, :cublasZhemv_v2_64, :ComplexF6
(:cublasChemv_v2, :cublasChemv_v2_64, :ComplexF32))
@eval begin
function hemv!(uplo::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
x::StridedCuVector{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
y::StridedCuVector{$elty})
# TODO: fix dimension check bug in julia
m, n = size(A)
@@ -922,10 +928,10 @@ for (fname, fname_64, elty) in ((:cublasDsbmv_v2, :cublasDsbmv_v2_64, :Float64),
@eval begin
function sbmv!(uplo::Char,
k::Integer,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
x::StridedCuVector{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
y::StridedCuVector{$elty})
m, n = size(A)
#if m != n throw(DimensionMismatch("Matrix A is $m by $n but must be square")) end
@@ -981,10 +987,10 @@ for (fname, fname_64, elty) in ((:cublasZhbmv_v2, :cublasZhbmv_v2_64, :ComplexF6
@eval begin
function hbmv!(uplo::Char,
k::Integer,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
x::StridedCuVector{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
y::StridedCuVector{$elty})
m, n = size(A)
if !(1<=(1+k)<=n) throw(DimensionMismatch("Incorrect number of bands")) end
@@ -1168,7 +1174,7 @@ for (fname, fname_64, elty) in ((:cublasDger_v2, :cublasDger_v2_64, :Float64),
(:cublasCgerc_v2, :cublasCgerc_v2_64, :ComplexF32))
@eval begin
function ger!(
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
x::StridedCuVector{$elty},
y::StridedCuVector{$elty},
A::StridedCuMatrix{$elty})
@@ -1204,7 +1210,7 @@ for (fname, fname_64, elty) in ((:cublasDspr_v2, :cublasDspr_v2_64, :Float64),
(:cublasSspr_v2, :cublasSspr_v2_64, :Float32))
@eval begin
function spr!(uplo::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
x::StridedCuVector{$elty},
AP::StridedCuVector{$elty})
n = round(Int, (sqrt(8*length(AP))-1)/2)
@@ -1238,7 +1244,7 @@ for (fname, fname_64, elty) in ((:cublasDsyr_v2, :cublasDsyr_v2_64, :Float64),
(:cublasCsyr_v2, :cublasCsyr_v2_64, :ComplexF32))
@eval begin
function syr!(uplo::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
x::StridedCuVector{$elty},
A::StridedCuMatrix{$elty})
m, n = size(A)
@@ -1274,7 +1280,7 @@ for (fname, fname_64, elty, relty) in (
)
@eval begin
function her!(uplo::Char,
- alpha::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
+ alpha::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
x::StridedCuVector{$elty},
A::StridedCuMatrix{$elty})
m, n = size(A)
@@ -1308,11 +1314,11 @@ for (fname, fname_64, elty) in ((:cublasZher2_v2, :cublasZher2_v2_64, :ComplexF6
(:cublasCher2_v2, :cublasCher2_v2_64, :ComplexF32))
@eval begin
function her2!(uplo::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
- x::StridedCuVector{$elty},
- y::StridedCuVector{$elty},
- A::StridedCuMatrix{$elty}
- )
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ x::StridedCuVector{$elty},
+ y::StridedCuVector{$elty},
+ A::StridedCuMatrix{$elty}
+ )
m, n = size(A)
m == n || throw(DimensionMismatch("Matrix A is $m by $n but must be square"))
length(x) == n || throw(DimensionMismatch("Length of vector must be the same as the matrix dimensions"))
@@ -1352,10 +1358,10 @@ for (fname, fname_64, elty) in ((:cublasDgemm_v2, :cublasDgemm_v2_64, :Float64),
@eval begin
function gemm!(transA::Char,
transB::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuVecOrMat{$elty},
B::StridedCuVecOrMat{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
C::StridedCuVecOrMat{$elty})
m = size(A, transA == 'N' ? 1 : 2)
k = size(A, transA == 'N' ? 2 : 1)
@@ -1493,10 +1499,10 @@ function gemmExComputeType(TA, TB, TC, m, k, n)
end
function gemmEx!(transA::Char, transB::Char,
- @nospecialize(alpha::CuRefArray),
+ @nospecialize(alpha::CuRefArray),
@nospecialize(A::StridedCuVecOrMat),
@nospecialize(B::StridedCuVecOrMat),
- @nospecialize(beta::CuRefArray),
+ @nospecialize(beta::CuRefArray),
@nospecialize(C::StridedCuVecOrMat);
algo::cublasGemmAlgo_t=CUBLAS_GEMM_DEFAULT)
m = size(A, transA == 'N' ? 1 : 2)
@@ -1551,10 +1557,10 @@ end
# TODO for device mode pointers
function gemmBatchedEx!(transA::Char, transB::Char,
- @nospecialize(alpha::CuRefArray),
+ @nospecialize(alpha::CuRefArray),
@nospecialize(A::Vector{<:StridedCuVecOrMat}),
@nospecialize(B::Vector{<:StridedCuVecOrMat}),
- @nospecialize(beta::CuRefArray),
+ @nospecialize(beta::CuRefArray),
@nospecialize(C::Vector{<:StridedCuVecOrMat});
algo::cublasGemmAlgo_t=CUBLAS_GEMM_DEFAULT)
if length(A) != length(B) || length(A) != length(C)
@@ -1622,11 +1628,11 @@ function gemmBatchedEx!(
end
function gemmStridedBatchedEx!(
- transA::Char, transB::Char,
- @nospecialize(alpha::CuRefArray),
+ transA::Char, transB::Char,
+ @nospecialize(alpha::CuRefArray),
@nospecialize(A::AbstractArray{Ta, 3}),
@nospecialize(B::AbstractArray{Tb, 3}),
- @nospecialize(beta::CuRefArray),
+ @nospecialize(beta::CuRefArray),
@nospecialize(C::AbstractArray{Tc, 3});
algo::cublasGemmAlgo_t=CUBLAS_GEMM_DEFAULT) where {Ta, Tb, Tc}
if size(A, 3) != size(B, 3) || size(A, 3) != size(C, 3)
@@ -1865,10 +1871,10 @@ for (fname, fname_64, elty) in ((:cublasDgemmBatched, :cublasDgemmBatched_64, :F
@eval begin
function gemm_batched!(transA::Char,
transB::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::Vector{<:StridedCuMatrix{$elty}},
B::Vector{<:StridedCuMatrix{$elty}},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
C::Vector{<:StridedCuMatrix{$elty}})
if length(A) != length(B) || length(A) != length(C)
throw(DimensionMismatch(""))
@@ -1948,10 +1954,10 @@ for (fname, fname_64, elty) in ((:cublasDgemmStridedBatched, :cublasDgemmStrided
@eval begin
function gemm_strided_batched!(transA::Char,
transB::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::AbstractArray{$elty, 3}, # allow PermutedDimsArray
B::AbstractArray{$elty, 3},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
C::AbstractArray{$elty, 3})
m = size(A, transA == 'N' ? 1 : 2)
k = size(A, transA == 'N' ? 2 : 1)
@@ -2031,10 +2037,10 @@ for (fname, fname_64, elty) in ((:cublasDsymm_v2, :cublasDsymm_v2_64, :Float64),
@eval begin
function symm!(side::Char,
uplo::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
B::StridedCuMatrix{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
C::StridedCuMatrix{$elty})
k, nA = size(A)
if k != nA throw(DimensionMismatch("Matrix A must be square")) end
@@ -2093,9 +2099,9 @@ for (fname, fname_64, elty) in ((:cublasDsyrk_v2, :cublasDsyrk_v2_64, :Float64),
@eval begin
function syrk!(uplo::Char,
trans::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuVecOrMat{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
C::StridedCuMatrix{$elty})
mC, n = size(C)
if mC != n throw(DimensionMismatch("C must be square")) end
@@ -2146,10 +2152,10 @@ for (fname, fname_64, elty) in ((:cublasDsyrkx, :cublasDsyrkx_64, :Float64),
@eval begin
function syrkx!(uplo::Char,
trans::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuVecOrMat{$elty},
B::StridedCuVecOrMat{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
C::StridedCuMatrix{$elty})
mC, n = size(C)
if mC != n throw(DimensionMismatch("C must be square")) end
@@ -2205,10 +2211,10 @@ for (fname, fname_64, elty) in ((:cublasZhemm_v2, :cublasZhemm_v2_64, :ComplexF6
@eval begin
function hemm!(side::Char,
uplo::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
B::StridedCuMatrix{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
C::StridedCuMatrix{$elty})
mA, nA = size(A)
m, n = size(B)
@@ -2268,9 +2274,9 @@ for (fname, fname_64, elty, relty) in (
@eval begin
function herk!(uplo::Char,
trans::Char,
- alpha::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
+ alpha::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
A::StridedCuVecOrMat{$elty},
- beta::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
+ beta::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
C::StridedCuMatrix{$elty})
mC, n = size(C)
if mC != n throw(DimensionMismatch("C must be square")) end
@@ -2327,10 +2333,10 @@ for (fname, fname_64, elty) in ((:cublasDsyr2k_v2, :cublasDsyr2k_v2_64, :Float64
@eval begin
function syr2k!(uplo::Char,
trans::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuVecOrMat{$elty},
B::StridedCuVecOrMat{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
C::StridedCuMatrix{$elty})
# TODO: check size of B in julia (syr2k!)
m, n = size(C)
@@ -2386,7 +2392,7 @@ function syr2k(uplo::Char,
B::StridedCuVecOrMat)
T = eltype(A)
n = size(A, trans == 'N' ? 1 : 2)
- syr2k!(uplo, trans, convert(T, alpha), A, B, zero(T), similar(A, T, (n, n)))
+ return syr2k!(uplo, trans, convert(T, alpha), A, B, zero(T), similar(A, T, (n, n)))
end
function syr2k(uplo::Char, trans::Char, A::StridedCuVecOrMat, B::StridedCuVecOrMat)
syr2k(uplo, trans, one(eltype(A)), A, B)
@@ -2400,10 +2406,10 @@ for (fname, fname_64, elty, relty) in (
@eval begin
function her2k!(uplo::Char,
trans::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuVecOrMat{$elty},
B::StridedCuVecOrMat{$elty},
- beta::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
+ beta::CuRefArray{$relty, CuVector{$relty, DeviceMemory}},
C::StridedCuMatrix{$elty})
# TODO: check size of B in julia (her2k!)
m, n = size(C)
@@ -2477,7 +2483,7 @@ for (mmname, smname, elty) in
uplo::Char,
transa::Char,
diag::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
B::StridedCuMatrix{$elty},
C::StridedCuMatrix{$elty})
@@ -2499,7 +2505,7 @@ for (mmname, smname, elty) in
uplo::Char,
transa::Char,
diag::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
B::StridedCuMatrix{$elty})
m, n = size(B)
@@ -2564,7 +2570,7 @@ for (fname, fname_64, elty) in ((:cublasDtrsmBatched, :cublasDtrsmBatched_64, :F
uplo::Char,
transa::Char,
diag::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::Vector{<:StridedCuMatrix{$elty}},
B::Vector{<:StridedCuMatrix{$elty}})
if length(A) != length(B)
@@ -2620,9 +2626,9 @@ for (fname, fname_64, elty) in ((:cublasDgeam, :cublasDgeam_64, :Float64),
@eval begin
function geam!(transa::Char,
transb::Char,
- alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ alpha::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
A::StridedCuMatrix{$elty},
- beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
+ beta::CuRefArray{$elty, CuVector{$elty, DeviceMemory}},
B::StridedCuMatrix{$elty},
C::StridedCuMatrix{$elty})
mA, nA = size(A)
@@ -2860,8 +2866,9 @@ for (fname, elty) in ((:cublasDgetriBatched, :Float64),
end
function getri_batched!(n, Aptrs::CuVector{CuPtr{$elty}},
- lda, Cptrs::CuVector{CuPtr{$elty}},ldc,
- pivotArray::CuArray{Cint})
+ lda, Cptrs::CuVector{CuPtr{$elty}}, ldc,
+ pivotArray::CuArray{Cint}
+ )
batchSize = length(Aptrs)
info = CuArray{Cint}(undef, batchSize)
$fname(handle(), n, Aptrs, lda, pivotArray, Cptrs, ldc, info, batchSize) |
CI failures seem relevant. Feel free to ignore the formatter; I made it less spammy 😉 |
Attempting to address #2571
I've set the pointer mode to "device side" during handle creation. Since
gemmGroupedBatched
doesn't support device side pointer mode, it won't be usable. One workaround for this would be to add a new function to create a handle with host side mode, or add the pointer mode as an optional kwarg tohandle()
. Very open to feedback on this.I've set this up so that users can supply
CuRef
s of the appropriate result type to the level 1 functions for results. If that's not provided, the functions execute as they do today (synchronously). Similarly, for functions takingalpha
orbeta
scalar arguments, if the user providesCuRef
(actually aCuRefArray
), the functions will execute asynchronously and return instantly. If the user provides aNumber
, the behaviour is unchanged from today. I'm not married to this design and it can certainly be changed.cc @Jutho