Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out level 3 gemm tests #2610

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Split out level 3 gemm tests #2610

wants to merge 1 commit into from

Conversation

kshyatt
Copy link
Contributor

@kshyatt kshyatt commented Jan 8, 2025

Testing locally, the level 3 and split-out level 3 GEMM-y tests seem to take the same amount of time. Should help with parallelization. Also removed an extraneous comment.

@kshyatt kshyatt requested a review from maleadt January 8, 2025 16:45
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: c8ee0b3 Previous: 3d45d85 Ratio
latency/precompile 45269838908.5 ns 45362897043 ns 1.00
latency/ttfp 6444243847.5 ns 6376155312.5 ns 1.01
latency/import 3056942594 ns 3036001837 ns 1.01
integration/volumerhs 9573166.5 ns 9568516 ns 1.00
integration/byval/slices=1 147066 ns 146875.5 ns 1.00
integration/byval/slices=3 425459 ns 425040 ns 1.00
integration/byval/reference 144943 ns 144927 ns 1.00
integration/byval/slices=2 286058 ns 286033 ns 1.00
integration/cudadevrt 103404 ns 103435 ns 1.00
kernel/indexing 14062 ns 14009 ns 1.00
kernel/indexing_checked 14877 ns 14794 ns 1.01
kernel/occupancy 709.9448275862069 ns 698.5298013245033 ns 1.02
kernel/launch 2114.5555555555557 ns 2154 ns 0.98
kernel/rand 15585 ns 18303 ns 0.85
array/reverse/1d 19625 ns 19605 ns 1.00
array/reverse/2d 24702 ns 24620 ns 1.00
array/reverse/1d_inplace 10733 ns 10792.666666666666 ns 0.99
array/reverse/2d_inplace 11080 ns 11263 ns 0.98
array/copy 20636 ns 20439 ns 1.01
array/iteration/findall/int 156655 ns 155820 ns 1.01
array/iteration/findall/bool 136293 ns 134569 ns 1.01
array/iteration/findfirst/int 154171 ns 154288 ns 1.00
array/iteration/findfirst/bool 153224 ns 153959 ns 1.00
array/iteration/scalar 62357 ns 61548 ns 1.01
array/iteration/logical 197438 ns 203707 ns 0.97
array/iteration/findmin/1d 38653 ns 38870 ns 0.99
array/iteration/findmin/2d 93874.5 ns 94333 ns 1.00
array/reductions/reduce/1d 38586 ns 30423 ns 1.27
array/reductions/reduce/2d 46894 ns 51457 ns 0.91
array/reductions/mapreduce/1d 35394 ns 30142 ns 1.17
array/reductions/mapreduce/2d 43810.5 ns 51380 ns 0.85
array/broadcast 21361 ns 21382 ns 1.00
array/copyto!/gpu_to_gpu 11557 ns 11620 ns 0.99
array/copyto!/cpu_to_gpu 209583 ns 209662 ns 1.00
array/copyto!/gpu_to_cpu 242146 ns 242902.5 ns 1.00
array/accumulate/1d 108411 ns 109331 ns 0.99
array/accumulate/2d 80177 ns 80156 ns 1.00
array/construct 1264.2 ns 1280.3 ns 0.99
array/random/randn/Float32 43486 ns 49367 ns 0.88
array/random/randn!/Float32 26362 ns 26244 ns 1.00
array/random/rand!/Int64 27227 ns 27126 ns 1.00
array/random/rand!/Float32 8704.333333333334 ns 8464.333333333334 ns 1.03
array/random/rand/Int64 29999 ns 35460 ns 0.85
array/random/rand/Float32 12893 ns 12776 ns 1.01
array/permutedims/4d 67505 ns 67483 ns 1.00
array/permutedims/2d 57096 ns 57092.5 ns 1.00
array/permutedims/3d 59559 ns 59419.5 ns 1.00
array/sorting/1d 2776620 ns 2776311.5 ns 1.00
array/sorting/by 3369035 ns 3367794.5 ns 1.00
array/sorting/2d 1084970.5 ns 1086101 ns 1.00
cuda/synchronization/stream/auto 1029.6 ns 1013.0833333333334 ns 1.02
cuda/synchronization/stream/nonblocking 6471.4 ns 6507 ns 0.99
cuda/synchronization/stream/blocking 800.2604166666666 ns 807.4622641509434 ns 0.99
cuda/synchronization/context/auto 1192.5 ns 1212.8 ns 0.98
cuda/synchronization/context/nonblocking 6641.2 ns 6677.8 ns 0.99
cuda/synchronization/context/blocking 912.1590909090909 ns 948.4545454545455 ns 0.96

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt
Copy link
Member

maleadt commented Jan 17, 2025

Failure seems related:

libraries/cublas/level3: Error During Test at /var/lib/buildkite-agent/builds/gpuci-8/julialang/cuda-dot-jl/test/libraries/cublas/level3.jl:20
2025-01-08 18:25:58 CEST	  Got exception outside of a @test
2025-01-08 18:25:58 CEST	  CUBLASError: an invalid value was used as an argument (code 7, CUBLAS_STATUS_INVALID_VALUE)

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 19, 2025

Can't repro this after rebasing onto latest master. Let me push and see if it persists.

Comment on lines +41 to +48
A = rand(elty,m,k)
B = rand(elty,k,n)
C1 = rand(elty,m,n)
C2 = copy(C1)
d_A = CuArray(A)
d_B = CuArray(B)
d_C1 = CuArray(C1)
d_C2 = CuArray(C2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A = rand(elty,m,k)
B = rand(elty,k,n)
C1 = rand(elty,m,n)
C2 = copy(C1)
d_A = CuArray(A)
d_B = CuArray(B)
d_C1 = CuArray(C1)
d_C2 = CuArray(C2)
A = rand(elty, m, k)
B = rand(elty, k, n)
C1 = rand(elty, m, n)
hA = rand(elty, m, m)
sA = rand(elty, m, m)
CUBLAS.gemm!('N', 'N', alpha, d_A, d_B, beta, d_C1)
C1 = (alpha * A) * B + beta * C1
C2 = A * B

Comment on lines +69 to +71
denseA = CUDA.rand(elty, 4,4)
denseB = CUDA.rand(elty, 4,4)
denseC = CUDA.zeros(elty, 4,4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
denseA = CUDA.rand(elty, 4,4)
denseB = CUDA.rand(elty, 4,4)
denseC = CUDA.zeros(elty, 4,4)
denseA = CUDA.rand(elty, 4, 4)
denseB = CUDA.rand(elty, 4, 4)
denseC = CUDA.zeros(elty, 4, 4)

Comment on lines +86 to +94
A = rand(elty,m,k)
B = rand(elty,k,n)
C1 = rand(elty,m,n)
d_A = CuArray(A)
d_B = CuArray(B)
d_C1 = CuArray(C1)
α = rand(elty)
β = rand(elty)
CUBLAS.gemmEx!('N','N',α,d_A,d_B,β,d_C1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A = rand(elty,m,k)
B = rand(elty,k,n)
C1 = rand(elty,m,n)
d_A = CuArray(A)
d_B = CuArray(B)
d_C1 = CuArray(C1)
α = rand(elty)
β = rand(elty)
CUBLAS.gemmEx!('N','N',α,d_A,d_B,β,d_C1)
A = rand(elty, m, k)
B = rand(elty, k, n)
C1 = rand(elty, m, n)
CUBLAS.gemmEx!('N', 'N', α, d_A, d_B, β, d_C1)
C1 = * A) * B + β * C1
A = rand(elty, m, k)
B = rand(elty, k, n)
d_C1 = CUBLAS.gemm('N', 'N', d_A, d_B)
C1 = A * B

Comment on lines +118 to +140
A = rand(elty,m,k)
B = rand(elty,k,n)
C1 = rand(elty,m,n)
C2 = copy(C1)
d_A = CuArray(A)
d_B = CuArray(B)
Bbad = rand(elty,k+1,n+1)
d_Bbad = CuArray(Bbad)
d_C1 = CuArray(C1)
d_C2 = CuArray(C2)
@test_throws DimensionMismatch CUBLAS.xt_gemm!('N','N',alpha,d_A,d_Bbad,beta,d_C1)
CUBLAS.xt_gemm!('N','N',alpha,d_A,d_B,beta,d_C1)
mul!(d_C2, d_A, d_B)
h_C1 = Array(d_C1)
h_C2 = Array(d_C2)
C1 = (alpha*A)*B + beta*C1
C2 = A*B
# compare
@test C1 ≈ h_C1
@test C2 ≈ h_C2
end
@testset "xt_gemm! cpu" begin
alpha = rand(elty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A = rand(elty,m,k)
B = rand(elty,k,n)
C1 = rand(elty,m,n)
C2 = copy(C1)
d_A = CuArray(A)
d_B = CuArray(B)
Bbad = rand(elty,k+1,n+1)
d_Bbad = CuArray(Bbad)
d_C1 = CuArray(C1)
d_C2 = CuArray(C2)
@test_throws DimensionMismatch CUBLAS.xt_gemm!('N','N',alpha,d_A,d_Bbad,beta,d_C1)
CUBLAS.xt_gemm!('N','N',alpha,d_A,d_B,beta,d_C1)
mul!(d_C2, d_A, d_B)
h_C1 = Array(d_C1)
h_C2 = Array(d_C2)
C1 = (alpha*A)*B + beta*C1
C2 = A*B
# compare
@test C1 h_C1
@test C2 h_C2
end
@testset "xt_gemm! cpu" begin
alpha = rand(elty)
A = rand(elty, m, k)
B = rand(elty, k, n)
C1 = rand(elty, m, n)
C2 = copy(C1)
Bbad = rand(elty, k + 1, n + 1)
@test_throws DimensionMismatch CUBLAS.xt_gemm!('N', 'N', alpha, d_A, d_Bbad, beta, d_C1)
CUBLAS.xt_gemm!('N', 'N', alpha, d_A, d_B, beta, d_C1)
C1 = (alpha * A) * B + beta * C1
C2 = A * B
beta = rand(elty)
A = rand(elty, m, k)
B = rand(elty, k, n)
C1 = rand(elty, m, n)
C2 = copy(C1)
C3 = copy(C1)
C4 = copy(C2)
CUBLAS.xt_gemm!('N', 'N', alpha, A, B, beta, C1)
C3 = (alpha * A) * B + beta * C3
C4 = A * B
A = rand(elty, m, k)
B = rand(elty, k, n)
d_C = CUBLAS.xt_gemm('N', 'N', d_A, d_B)
C = A * B

Comment on lines +172 to +196
A = rand(elty,m,k)
B = rand(elty,k,n)
C = CUBLAS.xt_gemm('N','N',A,B)
C2 = A*B
# compare
@test C isa Array
@test C ≈ A*B
@test C ≈ C2
end
@testset "symm!" begin
alpha = rand(elty)
beta = rand(elty)
sA = rand(elty,m,m)
sA = sA + transpose(sA)
dsA = CuArray(sA)
B = rand(elty,m,n)
C = rand(elty,m,n)
Bbad = rand(elty,m+1,n+1)
d_B = CuArray(B)
d_C = CuArray(C)
d_Bbad = CuArray(Bbad)
CUBLAS.symm!('L','U',alpha,dsA,d_B,beta,d_C)
C = (alpha*sA)*B + beta*C
# compare
h_C = Array(d_C)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A = rand(elty,m,k)
B = rand(elty,k,n)
C = CUBLAS.xt_gemm('N','N',A,B)
C2 = A*B
# compare
@test C isa Array
@test C A*B
@test C C2
end
@testset "symm!" begin
alpha = rand(elty)
beta = rand(elty)
sA = rand(elty,m,m)
sA = sA + transpose(sA)
dsA = CuArray(sA)
B = rand(elty,m,n)
C = rand(elty,m,n)
Bbad = rand(elty,m+1,n+1)
d_B = CuArray(B)
d_C = CuArray(C)
d_Bbad = CuArray(Bbad)
CUBLAS.symm!('L','U',alpha,dsA,d_B,beta,d_C)
C = (alpha*sA)*B + beta*C
# compare
h_C = Array(d_C)
A = rand(elty, m, k)
B = rand(elty, k, n)
C = CUBLAS.xt_gemm('N', 'N', A, B)
C2 = A * B
@test C A * B
sA = rand(elty, m, m)
B = rand(elty, m, n)
C = rand(elty, m, n)
Bbad = rand(elty, m + 1, n + 1)
CUBLAS.symm!('L', 'U', alpha, dsA, d_B, beta, d_C)
C = (alpha * sA) * B + beta * C
@test_throws DimensionMismatch CUBLAS.symm!('L', 'U', alpha, dsA, d_Bbad, beta, d_C)
sA = rand(elty, m, m)
B = rand(elty, m, n)
C = rand(elty, m, n)
Bbad = rand(elty, m + 1, n + 1)
d_C = CUBLAS.symm('L', 'U', dsA, d_B)
C = sA * B
@test_throws DimensionMismatch CUBLAS.symm('L', 'U', dsA, d_Bbad)
sA = rand(elty, m, m)
B = rand(elty, m, n)
C = rand(elty, m, n)
Bbad = rand(elty, m + 1, n + 1)
CUBLAS.xt_symm!('L', 'U', alpha, dsA, d_B, beta, d_C)
C = (alpha * sA) * B + beta * C

Comment on lines +678 to +684
bA = [rand(elty,3*i,2*i) for i in 1:10]
bB = [rand(elty,2*i,5*i) for i in 1:10]
bC = [rand(elty,3*i,5*i) for i in 1:10]
# move to device
bd_A = CuArray{elty, 2}[]
bd_B = CuArray{elty, 2}[]
bd_C = CuArray{elty, 2}[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bA = [rand(elty,3*i,2*i) for i in 1:10]
bB = [rand(elty,2*i,5*i) for i in 1:10]
bC = [rand(elty,3*i,5*i) for i in 1:10]
# move to device
bd_A = CuArray{elty, 2}[]
bd_B = CuArray{elty, 2}[]
bd_C = CuArray{elty, 2}[]
bA = [rand(elty, 3 * i, 2 * i) for i in 1:10]
bB = [rand(elty, 2 * i, 5 * i) for i in 1:10]
bC = [rand(elty, 3 * i, 5 * i) for i in 1:10]
push!(bd_A, CuArray(bA[i]))
push!(bd_B, CuArray(bB[i]))
push!(bd_C, CuArray(bC[i]))
CUBLAS.gemm_grouped_batched!(transA, transB, alpha, bd_A, bd_B, beta, bd_C)

end

@testset "gemm_grouped_batched" begin
bd_C = CUBLAS.gemm_grouped_batched(transA,transB,bd_A,bd_B)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bd_C = CUBLAS.gemm_grouped_batched(transA,transB,bd_A,bd_B)
bd_C = CUBLAS.gemm_grouped_batched(transA, transB, bd_A, bd_B)

Comment on lines +713 to +720
m,k,n = 4,4,4
cudaTypes = (Float16, Complex{Float16}, BFloat16, Complex{BFloat16}, Float32, Complex{Float32},
Float64, Complex{Float64}, Int8, Complex{Int8}, UInt8, Complex{UInt8},
Int16, Complex{Int16}, UInt16, Complex{UInt16}, Int32, Complex{Int32},
UInt32, Complex{UInt32}, Int64, Complex{Int64}, UInt64, Complex{UInt64})

for AT in cudaTypes, CT in cudaTypes
BT = AT # gemmEx requires identical A and B types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m,k,n = 4,4,4
cudaTypes = (Float16, Complex{Float16}, BFloat16, Complex{BFloat16}, Float32, Complex{Float32},
Float64, Complex{Float64}, Int8, Complex{Int8}, UInt8, Complex{UInt8},
Int16, Complex{Int16}, UInt16, Complex{UInt16}, Int32, Complex{Int32},
UInt32, Complex{UInt32}, Int64, Complex{Int64}, UInt64, Complex{UInt64})
for AT in cudaTypes, CT in cudaTypes
BT = AT # gemmEx requires identical A and B types
m, k, n = 4, 4, 4
cudaTypes = (
Float16, Complex{Float16}, BFloat16, Complex{BFloat16}, Float32, Complex{Float32},
Float64, Complex{Float64}, Int8, Complex{Int8}, UInt8, Complex{UInt8},
Int16, Complex{Int16}, UInt16, Complex{UInt16}, Int32, Complex{Int32},
UInt32, Complex{UInt32}, Int64, Complex{Int64}, UInt64, Complex{UInt64},
)
if CUBLAS.gemmExComputeType(AT, BT, CT, m, k, n) !== nothing
A = AT <: BFloat16 ? AT.(rand(m, k)) : rand(AT, m, k)
B = BT <: BFloat16 ? BT.(rand(k, n)) : rand(BT, k, n)

Comment on lines +740 to +745
@test C ≈ Array(dC) rtol=rtol
end
end

# also test an unsupported combination (falling back to GPUArrays)
if VERSION < v"1.11-" # JuliaGPU/CUDA.jl#2441
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@test C Array(dC) rtol=rtol
end
end
# also test an unsupported combination (falling back to GPUArrays)
if VERSION < v"1.11-" # JuliaGPU/CUDA.jl#2441
@test C Array(dC) rtol = rtol
AT = BFloat16
BT = Int32
CT = Float64
A = AT.(rand(m, k))
B = rand(BT, k, n)

Comment on lines +761 to +764
@test C ≈ Array(dC) rtol=rtol
end
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@test C Array(dC) rtol=rtol
end
end
@test C Array(dC) rtol = rtol
testf(randn(784 * 100), rand(Float32, 784, 100)) do p, x
p[reshape(1:(out * inn), out, inn)] * x
@view(p[reshape(1:(out * inn), out, inn)]) * x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants