You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
julia>@benchmarkv4!(r, d, n) setup=(n=4000;r=zeros(Float32,n,n);d=rand(Float32,n,n);d[1:(n+1):(n*n)].=0f0) evals=1 seconds=60
BenchmarkTools.Trial:15 samples with 1 evaluation.
Range (min … max):4.070 s …4.434 s ┊ GC (min … max):0.03%…0.03%
Time (median):4.169 s ┊ GC (median):0.03%
Time (mean ± σ):4.185 s ±95.457 ms ┊ GC (mean ± σ):0.03%±0.00%
▁ ▁▁▁ █ ▁ ▁ ▁ ▁▁ ▁ ▁ ▁ ▁
█▁███▁▁▁▁█▁▁▁█▁█▁▁█▁▁▁██▁█▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
4.07 s Histogram: frequency by time 4.43 s <
Memory estimate:122.14 MiB, allocs estimate:56.
julia>versioninfo()
Julia Version 1.10.0-DEV.281
Commit 77fa4cb175* (2023-01-0512:47 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU:4×Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
WORD_SIZE:64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads:4 on 4 virtual cores
Environment:
JULIA_NUM_THREADS =4
vs.
julia>@benchmarkv4!(r, d, n) setup=(n=4000;r=zeros(Float32,n,n);d=rand(Float32,n,n);d[1:(n+1):(n*n)].=0f0) evals=1 seconds=60
BenchmarkTools.Trial:18 samples with 1 evaluation.
Range (min … max):3.308 s …3.936 s ┊ GC (min … max):0.02%…0.01%
Time (median):3.410 s ┊ GC (median):0.02%
Time (mean ± σ):3.477 s ±167.066 ms ┊ GC (mean ± σ):0.02%±0.00%
▃ █ ▃
▇▇▇▁▇▁█▁█▇▁▁▇▁▁█▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▇▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
3.31 s Histogram: frequency by time 3.94 s <
Memory estimate:122.14 MiB, allocs estimate:55.
julia>versioninfo()
Julia Version 1.9.0-DEV.1180
Commit 36aab14a97* (2022-08-2510:17 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU:4×Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
WORD_SIZE:64
LIBM: libopenlibm
LLVM: libLLVM-14.0.5 (ORCJIT, skylake)
Threads:4 on 4 virtual cores
Environment:
JULIA_NUM_THREADS =4
The code is
Code
using Base.Threads
using Base.Cartesian
functionv4!(res, data_, n)
@boundscheckcheckbounds(res, 1:n)
@boundscheckcheckbounds(data_, 1:n)
vectorwidth =8
nvectors =div(n + vectorwidth -1, vectorwidth)
colsize = nvectors*vectorwidth # number of rows
blockwidth =3
nblocks =div(n + blockwidth -1, blockwidth)
rowsize = nblocks*blockwidth # number of columns
data =Matrix{Float32}(undef, colsize, rowsize)
tran =Matrix{Float32}(undef, colsize, rowsize)
Threads.@threadsfor row in1:rowsize
@inboundsfor col in1:n
data[col,row] = row <= n ? data_[col,row] :Inf32
tran[col,row] = row <= n ? data_[row,col] :Inf32endend@inboundsfor row in1:rowsize
for col in n+1:colsize
data[col,row] =Inf32
tran[col,row] =Inf32endend
Threads.@threadsfor row in1:blockwidth:n
@inboundsfor col in1:blockwidth:n
# v_1_1_1, v_1_1_2, ..., v_2_1_1, ..., v_3_3_8@nexprs3 l ->begin@nexprs3 k ->begin
v_k_l =@ntuple8 _ ->Inf32endendfor block in1:vectorwidth:colsize
# x_1_1, ..., x_3_8# y_1_1, ..., y_3_8@nexprs3 k -> y_k =@ntuple8 i -> tran[block+(i-1), row+(k-1)]
@nexprs3 k -> x_k =@ntuple8 i -> data[block+(i-1), col+(k-1)]
@nexprs3 k ->begin@nexprs3 l ->begin
z = x_k .+ y_l
v_k_l =@fastmathmin.(v_k_l, z)
endendend@nexprs3 l ->begin@nexprs3 k ->begin
out_row_l = row+(l-1)
out_col_k = col+(k-1)
if out_row_l <= n && out_col_k <= n
res[out_col_k, out_row_l] =@fastmathmin(v_k_l...)
endendendendendend
I haven't looked into it in detail yet, but I suspect some difference due to a change in reduction behavior. The code (if it were taking advantage of everything it should) should run in ~2.2s on my machine, so something else is amiss too, but this is the regression I observed during development. I'll try to dig into this a bit tomorrow, looking into differences in generated LLVM IR and assembly.
The text was updated successfully, but these errors were encountered:
vs.
The code is
Code
I haven't looked into it in detail yet, but I suspect some difference due to a change in reduction behavior. The code (if it were taking advantage of everything it should) should run in ~2.2s on my machine, so something else is amiss too, but this is the regression I observed during development. I'll try to dig into this a bit tomorrow, looking into differences in generated LLVM IR and assembly.
The text was updated successfully, but these errors were encountered: