-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split out level 3 gemm tests #2610
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: c8ee0b3 | Previous: 3d45d85 | Ratio |
---|---|---|---|
latency/precompile |
45269838908.5 ns |
45362897043 ns |
1.00 |
latency/ttfp |
6444243847.5 ns |
6376155312.5 ns |
1.01 |
latency/import |
3056942594 ns |
3036001837 ns |
1.01 |
integration/volumerhs |
9573166.5 ns |
9568516 ns |
1.00 |
integration/byval/slices=1 |
147066 ns |
146875.5 ns |
1.00 |
integration/byval/slices=3 |
425459 ns |
425040 ns |
1.00 |
integration/byval/reference |
144943 ns |
144927 ns |
1.00 |
integration/byval/slices=2 |
286058 ns |
286033 ns |
1.00 |
integration/cudadevrt |
103404 ns |
103435 ns |
1.00 |
kernel/indexing |
14062 ns |
14009 ns |
1.00 |
kernel/indexing_checked |
14877 ns |
14794 ns |
1.01 |
kernel/occupancy |
709.9448275862069 ns |
698.5298013245033 ns |
1.02 |
kernel/launch |
2114.5555555555557 ns |
2154 ns |
0.98 |
kernel/rand |
15585 ns |
18303 ns |
0.85 |
array/reverse/1d |
19625 ns |
19605 ns |
1.00 |
array/reverse/2d |
24702 ns |
24620 ns |
1.00 |
array/reverse/1d_inplace |
10733 ns |
10792.666666666666 ns |
0.99 |
array/reverse/2d_inplace |
11080 ns |
11263 ns |
0.98 |
array/copy |
20636 ns |
20439 ns |
1.01 |
array/iteration/findall/int |
156655 ns |
155820 ns |
1.01 |
array/iteration/findall/bool |
136293 ns |
134569 ns |
1.01 |
array/iteration/findfirst/int |
154171 ns |
154288 ns |
1.00 |
array/iteration/findfirst/bool |
153224 ns |
153959 ns |
1.00 |
array/iteration/scalar |
62357 ns |
61548 ns |
1.01 |
array/iteration/logical |
197438 ns |
203707 ns |
0.97 |
array/iteration/findmin/1d |
38653 ns |
38870 ns |
0.99 |
array/iteration/findmin/2d |
93874.5 ns |
94333 ns |
1.00 |
array/reductions/reduce/1d |
38586 ns |
30423 ns |
1.27 |
array/reductions/reduce/2d |
46894 ns |
51457 ns |
0.91 |
array/reductions/mapreduce/1d |
35394 ns |
30142 ns |
1.17 |
array/reductions/mapreduce/2d |
43810.5 ns |
51380 ns |
0.85 |
array/broadcast |
21361 ns |
21382 ns |
1.00 |
array/copyto!/gpu_to_gpu |
11557 ns |
11620 ns |
0.99 |
array/copyto!/cpu_to_gpu |
209583 ns |
209662 ns |
1.00 |
array/copyto!/gpu_to_cpu |
242146 ns |
242902.5 ns |
1.00 |
array/accumulate/1d |
108411 ns |
109331 ns |
0.99 |
array/accumulate/2d |
80177 ns |
80156 ns |
1.00 |
array/construct |
1264.2 ns |
1280.3 ns |
0.99 |
array/random/randn/Float32 |
43486 ns |
49367 ns |
0.88 |
array/random/randn!/Float32 |
26362 ns |
26244 ns |
1.00 |
array/random/rand!/Int64 |
27227 ns |
27126 ns |
1.00 |
array/random/rand!/Float32 |
8704.333333333334 ns |
8464.333333333334 ns |
1.03 |
array/random/rand/Int64 |
29999 ns |
35460 ns |
0.85 |
array/random/rand/Float32 |
12893 ns |
12776 ns |
1.01 |
array/permutedims/4d |
67505 ns |
67483 ns |
1.00 |
array/permutedims/2d |
57096 ns |
57092.5 ns |
1.00 |
array/permutedims/3d |
59559 ns |
59419.5 ns |
1.00 |
array/sorting/1d |
2776620 ns |
2776311.5 ns |
1.00 |
array/sorting/by |
3369035 ns |
3367794.5 ns |
1.00 |
array/sorting/2d |
1084970.5 ns |
1086101 ns |
1.00 |
cuda/synchronization/stream/auto |
1029.6 ns |
1013.0833333333334 ns |
1.02 |
cuda/synchronization/stream/nonblocking |
6471.4 ns |
6507 ns |
0.99 |
cuda/synchronization/stream/blocking |
800.2604166666666 ns |
807.4622641509434 ns |
0.99 |
cuda/synchronization/context/auto |
1192.5 ns |
1212.8 ns |
0.98 |
cuda/synchronization/context/nonblocking |
6641.2 ns |
6677.8 ns |
0.99 |
cuda/synchronization/context/blocking |
912.1590909090909 ns |
948.4545454545455 ns |
0.96 |
This comment was automatically generated by workflow using github-action-benchmark.
Failure seems related:
|
Can't repro this after rebasing onto latest master. Let me push and see if it persists. |
A = rand(elty,m,k) | ||
B = rand(elty,k,n) | ||
C1 = rand(elty,m,n) | ||
C2 = copy(C1) | ||
d_A = CuArray(A) | ||
d_B = CuArray(B) | ||
d_C1 = CuArray(C1) | ||
d_C2 = CuArray(C2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A = rand(elty,m,k) | |
B = rand(elty,k,n) | |
C1 = rand(elty,m,n) | |
C2 = copy(C1) | |
d_A = CuArray(A) | |
d_B = CuArray(B) | |
d_C1 = CuArray(C1) | |
d_C2 = CuArray(C2) | |
A = rand(elty, m, k) | |
B = rand(elty, k, n) | |
C1 = rand(elty, m, n) | |
hA = rand(elty, m, m) | |
sA = rand(elty, m, m) | |
CUBLAS.gemm!('N', 'N', alpha, d_A, d_B, beta, d_C1) | |
C1 = (alpha * A) * B + beta * C1 | |
C2 = A * B |
denseA = CUDA.rand(elty, 4,4) | ||
denseB = CUDA.rand(elty, 4,4) | ||
denseC = CUDA.zeros(elty, 4,4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
denseA = CUDA.rand(elty, 4,4) | |
denseB = CUDA.rand(elty, 4,4) | |
denseC = CUDA.zeros(elty, 4,4) | |
denseA = CUDA.rand(elty, 4, 4) | |
denseB = CUDA.rand(elty, 4, 4) | |
denseC = CUDA.zeros(elty, 4, 4) |
A = rand(elty,m,k) | ||
B = rand(elty,k,n) | ||
C1 = rand(elty,m,n) | ||
d_A = CuArray(A) | ||
d_B = CuArray(B) | ||
d_C1 = CuArray(C1) | ||
α = rand(elty) | ||
β = rand(elty) | ||
CUBLAS.gemmEx!('N','N',α,d_A,d_B,β,d_C1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A = rand(elty,m,k) | |
B = rand(elty,k,n) | |
C1 = rand(elty,m,n) | |
d_A = CuArray(A) | |
d_B = CuArray(B) | |
d_C1 = CuArray(C1) | |
α = rand(elty) | |
β = rand(elty) | |
CUBLAS.gemmEx!('N','N',α,d_A,d_B,β,d_C1) | |
A = rand(elty, m, k) | |
B = rand(elty, k, n) | |
C1 = rand(elty, m, n) | |
CUBLAS.gemmEx!('N', 'N', α, d_A, d_B, β, d_C1) | |
C1 = (α * A) * B + β * C1 | |
A = rand(elty, m, k) | |
B = rand(elty, k, n) | |
d_C1 = CUBLAS.gemm('N', 'N', d_A, d_B) | |
C1 = A * B |
A = rand(elty,m,k) | ||
B = rand(elty,k,n) | ||
C1 = rand(elty,m,n) | ||
C2 = copy(C1) | ||
d_A = CuArray(A) | ||
d_B = CuArray(B) | ||
Bbad = rand(elty,k+1,n+1) | ||
d_Bbad = CuArray(Bbad) | ||
d_C1 = CuArray(C1) | ||
d_C2 = CuArray(C2) | ||
@test_throws DimensionMismatch CUBLAS.xt_gemm!('N','N',alpha,d_A,d_Bbad,beta,d_C1) | ||
CUBLAS.xt_gemm!('N','N',alpha,d_A,d_B,beta,d_C1) | ||
mul!(d_C2, d_A, d_B) | ||
h_C1 = Array(d_C1) | ||
h_C2 = Array(d_C2) | ||
C1 = (alpha*A)*B + beta*C1 | ||
C2 = A*B | ||
# compare | ||
@test C1 ≈ h_C1 | ||
@test C2 ≈ h_C2 | ||
end | ||
@testset "xt_gemm! cpu" begin | ||
alpha = rand(elty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A = rand(elty,m,k) | |
B = rand(elty,k,n) | |
C1 = rand(elty,m,n) | |
C2 = copy(C1) | |
d_A = CuArray(A) | |
d_B = CuArray(B) | |
Bbad = rand(elty,k+1,n+1) | |
d_Bbad = CuArray(Bbad) | |
d_C1 = CuArray(C1) | |
d_C2 = CuArray(C2) | |
@test_throws DimensionMismatch CUBLAS.xt_gemm!('N','N',alpha,d_A,d_Bbad,beta,d_C1) | |
CUBLAS.xt_gemm!('N','N',alpha,d_A,d_B,beta,d_C1) | |
mul!(d_C2, d_A, d_B) | |
h_C1 = Array(d_C1) | |
h_C2 = Array(d_C2) | |
C1 = (alpha*A)*B + beta*C1 | |
C2 = A*B | |
# compare | |
@test C1 ≈ h_C1 | |
@test C2 ≈ h_C2 | |
end | |
@testset "xt_gemm! cpu" begin | |
alpha = rand(elty) | |
A = rand(elty, m, k) | |
B = rand(elty, k, n) | |
C1 = rand(elty, m, n) | |
C2 = copy(C1) | |
Bbad = rand(elty, k + 1, n + 1) | |
@test_throws DimensionMismatch CUBLAS.xt_gemm!('N', 'N', alpha, d_A, d_Bbad, beta, d_C1) | |
CUBLAS.xt_gemm!('N', 'N', alpha, d_A, d_B, beta, d_C1) | |
C1 = (alpha * A) * B + beta * C1 | |
C2 = A * B | |
beta = rand(elty) | |
A = rand(elty, m, k) | |
B = rand(elty, k, n) | |
C1 = rand(elty, m, n) | |
C2 = copy(C1) | |
C3 = copy(C1) | |
C4 = copy(C2) | |
CUBLAS.xt_gemm!('N', 'N', alpha, A, B, beta, C1) | |
C3 = (alpha * A) * B + beta * C3 | |
C4 = A * B | |
A = rand(elty, m, k) | |
B = rand(elty, k, n) | |
d_C = CUBLAS.xt_gemm('N', 'N', d_A, d_B) | |
C = A * B |
A = rand(elty,m,k) | ||
B = rand(elty,k,n) | ||
C = CUBLAS.xt_gemm('N','N',A,B) | ||
C2 = A*B | ||
# compare | ||
@test C isa Array | ||
@test C ≈ A*B | ||
@test C ≈ C2 | ||
end | ||
@testset "symm!" begin | ||
alpha = rand(elty) | ||
beta = rand(elty) | ||
sA = rand(elty,m,m) | ||
sA = sA + transpose(sA) | ||
dsA = CuArray(sA) | ||
B = rand(elty,m,n) | ||
C = rand(elty,m,n) | ||
Bbad = rand(elty,m+1,n+1) | ||
d_B = CuArray(B) | ||
d_C = CuArray(C) | ||
d_Bbad = CuArray(Bbad) | ||
CUBLAS.symm!('L','U',alpha,dsA,d_B,beta,d_C) | ||
C = (alpha*sA)*B + beta*C | ||
# compare | ||
h_C = Array(d_C) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A = rand(elty,m,k) | |
B = rand(elty,k,n) | |
C = CUBLAS.xt_gemm('N','N',A,B) | |
C2 = A*B | |
# compare | |
@test C isa Array | |
@test C ≈ A*B | |
@test C ≈ C2 | |
end | |
@testset "symm!" begin | |
alpha = rand(elty) | |
beta = rand(elty) | |
sA = rand(elty,m,m) | |
sA = sA + transpose(sA) | |
dsA = CuArray(sA) | |
B = rand(elty,m,n) | |
C = rand(elty,m,n) | |
Bbad = rand(elty,m+1,n+1) | |
d_B = CuArray(B) | |
d_C = CuArray(C) | |
d_Bbad = CuArray(Bbad) | |
CUBLAS.symm!('L','U',alpha,dsA,d_B,beta,d_C) | |
C = (alpha*sA)*B + beta*C | |
# compare | |
h_C = Array(d_C) | |
A = rand(elty, m, k) | |
B = rand(elty, k, n) | |
C = CUBLAS.xt_gemm('N', 'N', A, B) | |
C2 = A * B | |
@test C ≈ A * B | |
sA = rand(elty, m, m) | |
B = rand(elty, m, n) | |
C = rand(elty, m, n) | |
Bbad = rand(elty, m + 1, n + 1) | |
CUBLAS.symm!('L', 'U', alpha, dsA, d_B, beta, d_C) | |
C = (alpha * sA) * B + beta * C | |
@test_throws DimensionMismatch CUBLAS.symm!('L', 'U', alpha, dsA, d_Bbad, beta, d_C) | |
sA = rand(elty, m, m) | |
B = rand(elty, m, n) | |
C = rand(elty, m, n) | |
Bbad = rand(elty, m + 1, n + 1) | |
d_C = CUBLAS.symm('L', 'U', dsA, d_B) | |
C = sA * B | |
@test_throws DimensionMismatch CUBLAS.symm('L', 'U', dsA, d_Bbad) | |
sA = rand(elty, m, m) | |
B = rand(elty, m, n) | |
C = rand(elty, m, n) | |
Bbad = rand(elty, m + 1, n + 1) | |
CUBLAS.xt_symm!('L', 'U', alpha, dsA, d_B, beta, d_C) | |
C = (alpha * sA) * B + beta * C |
bA = [rand(elty,3*i,2*i) for i in 1:10] | ||
bB = [rand(elty,2*i,5*i) for i in 1:10] | ||
bC = [rand(elty,3*i,5*i) for i in 1:10] | ||
# move to device | ||
bd_A = CuArray{elty, 2}[] | ||
bd_B = CuArray{elty, 2}[] | ||
bd_C = CuArray{elty, 2}[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bA = [rand(elty,3*i,2*i) for i in 1:10] | |
bB = [rand(elty,2*i,5*i) for i in 1:10] | |
bC = [rand(elty,3*i,5*i) for i in 1:10] | |
# move to device | |
bd_A = CuArray{elty, 2}[] | |
bd_B = CuArray{elty, 2}[] | |
bd_C = CuArray{elty, 2}[] | |
bA = [rand(elty, 3 * i, 2 * i) for i in 1:10] | |
bB = [rand(elty, 2 * i, 5 * i) for i in 1:10] | |
bC = [rand(elty, 3 * i, 5 * i) for i in 1:10] | |
push!(bd_A, CuArray(bA[i])) | |
push!(bd_B, CuArray(bB[i])) | |
push!(bd_C, CuArray(bC[i])) | |
CUBLAS.gemm_grouped_batched!(transA, transB, alpha, bd_A, bd_B, beta, bd_C) |
end | ||
|
||
@testset "gemm_grouped_batched" begin | ||
bd_C = CUBLAS.gemm_grouped_batched(transA,transB,bd_A,bd_B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bd_C = CUBLAS.gemm_grouped_batched(transA,transB,bd_A,bd_B) | |
bd_C = CUBLAS.gemm_grouped_batched(transA, transB, bd_A, bd_B) |
m,k,n = 4,4,4 | ||
cudaTypes = (Float16, Complex{Float16}, BFloat16, Complex{BFloat16}, Float32, Complex{Float32}, | ||
Float64, Complex{Float64}, Int8, Complex{Int8}, UInt8, Complex{UInt8}, | ||
Int16, Complex{Int16}, UInt16, Complex{UInt16}, Int32, Complex{Int32}, | ||
UInt32, Complex{UInt32}, Int64, Complex{Int64}, UInt64, Complex{UInt64}) | ||
|
||
for AT in cudaTypes, CT in cudaTypes | ||
BT = AT # gemmEx requires identical A and B types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m,k,n = 4,4,4 | |
cudaTypes = (Float16, Complex{Float16}, BFloat16, Complex{BFloat16}, Float32, Complex{Float32}, | |
Float64, Complex{Float64}, Int8, Complex{Int8}, UInt8, Complex{UInt8}, | |
Int16, Complex{Int16}, UInt16, Complex{UInt16}, Int32, Complex{Int32}, | |
UInt32, Complex{UInt32}, Int64, Complex{Int64}, UInt64, Complex{UInt64}) | |
for AT in cudaTypes, CT in cudaTypes | |
BT = AT # gemmEx requires identical A and B types | |
m, k, n = 4, 4, 4 | |
cudaTypes = ( | |
Float16, Complex{Float16}, BFloat16, Complex{BFloat16}, Float32, Complex{Float32}, | |
Float64, Complex{Float64}, Int8, Complex{Int8}, UInt8, Complex{UInt8}, | |
Int16, Complex{Int16}, UInt16, Complex{UInt16}, Int32, Complex{Int32}, | |
UInt32, Complex{UInt32}, Int64, Complex{Int64}, UInt64, Complex{UInt64}, | |
) | |
if CUBLAS.gemmExComputeType(AT, BT, CT, m, k, n) !== nothing | |
A = AT <: BFloat16 ? AT.(rand(m, k)) : rand(AT, m, k) | |
B = BT <: BFloat16 ? BT.(rand(k, n)) : rand(BT, k, n) |
@test C ≈ Array(dC) rtol=rtol | ||
end | ||
end | ||
|
||
# also test an unsupported combination (falling back to GPUArrays) | ||
if VERSION < v"1.11-" # JuliaGPU/CUDA.jl#2441 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@test C ≈ Array(dC) rtol=rtol | |
end | |
end | |
# also test an unsupported combination (falling back to GPUArrays) | |
if VERSION < v"1.11-" # JuliaGPU/CUDA.jl#2441 | |
@test C ≈ Array(dC) rtol = rtol | |
AT = BFloat16 | |
BT = Int32 | |
CT = Float64 | |
A = AT.(rand(m, k)) | |
B = rand(BT, k, n) |
@test C ≈ Array(dC) rtol=rtol | ||
end | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@test C ≈ Array(dC) rtol=rtol | |
end | |
end | |
@test C ≈ Array(dC) rtol = rtol | |
testf(randn(784 * 100), rand(Float32, 784, 100)) do p, x | |
p[reshape(1:(out * inn), out, inn)] * x | |
@view(p[reshape(1:(out * inn), out, inn)]) * x |
Testing locally, the level 3 and split-out level 3 GEMM-y tests seem to take the same amount of time. Should help with parallelization. Also removed an extraneous comment.