Performance regression of tight loops on tuples between `1.9.0-DEV.1180` and `1.10.0-DEV.281` #48240

Seelengrab · 2023-01-11T21:29:04Z

julia> @benchmark v4!(r, d, n) setup=(n=4000;r=zeros(Float32,n,n);d=rand(Float32,n,n);d[1:(n+1):(n*n)].=0f0) evals=1 seconds=60
BenchmarkTools.Trial: 15 samples with 1 evaluation.
 Range (min … max):  4.070 s …   4.434 s  ┊ GC (min … max): 0.03% … 0.03%
 Time  (median):     4.169 s              ┊ GC (median):    0.03%
 Time  (mean ± σ):   4.185 s ± 95.457 ms  ┊ GC (mean ± σ):  0.03% ± 0.00%

  ▁ ▁▁▁    █   ▁ ▁  ▁   ▁▁ ▁    ▁ ▁                       ▁  
  █▁███▁▁▁▁█▁▁▁█▁█▁▁█▁▁▁██▁█▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.07 s         Histogram: frequency by time        4.43 s <

 Memory estimate: 122.14 MiB, allocs estimate: 56.

julia> versioninfo()
Julia Version 1.10.0-DEV.281
Commit 77fa4cb175* (2023-01-05 12:47 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_NUM_THREADS = 4

vs.

julia> @benchmark v4!(r, d, n) setup=(n=4000;r=zeros(Float32,n,n);d=rand(Float32,n,n);d[1:(n+1):(n*n)].=0f0) evals=1 seconds=60
BenchmarkTools.Trial: 18 samples with 1 evaluation.
 Range (min … max):  3.308 s …    3.936 s  ┊ GC (min … max): 0.02% … 0.01%
 Time  (median):     3.410 s               ┊ GC (median):    0.02%
 Time  (mean ± σ):   3.477 s ± 167.066 ms  ┊ GC (mean ± σ):  0.02% ± 0.00%

        ▃ █      ▃                                            
  ▇▇▇▁▇▁█▁█▇▁▁▇▁▁█▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▇▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  3.31 s         Histogram: frequency by time         3.94 s <

 Memory estimate: 122.14 MiB, allocs estimate: 55.

julia> versioninfo()
Julia Version 1.9.0-DEV.1180
Commit 36aab14a97* (2022-08-25 10:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.5 (ORCJIT, skylake)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_NUM_THREADS = 4

The code is

Code

using Base.Threads
using Base.Cartesian

function v4!(res, data_, n)
    @boundscheck checkbounds(res, 1:n)
    @boundscheck checkbounds(data_, 1:n)
    vectorwidth = 8
    nvectors = div(n + vectorwidth - 1, vectorwidth)
    colsize = nvectors*vectorwidth # number of rows

    blockwidth = 3
    nblocks = div(n + blockwidth - 1, blockwidth)
    rowsize = nblocks*blockwidth # number of columns

    data = Matrix{Float32}(undef, colsize, rowsize)
    tran = Matrix{Float32}(undef, colsize, rowsize)
    Threads.@threads for row in 1:rowsize
        @inbounds for col in 1:n
            data[col,row] = row <= n ? data_[col,row] : Inf32
            tran[col,row] = row <= n ? data_[row,col] : Inf32
        end
    end
    @inbounds for row in 1:rowsize
        for col in n+1:colsize
            data[col,row] = Inf32
            tran[col,row] = Inf32
        end
    end

    Threads.@threads for row in 1:blockwidth:n
        @inbounds for col in 1:blockwidth:n
            # v_1_1_1, v_1_1_2, ..., v_2_1_1, ..., v_3_3_8
            @nexprs 3 l -> begin
            @nexprs 3 k -> begin
                v_k_l = @ntuple 8 _ -> Inf32
            end
            end

            for block in 1:vectorwidth:colsize
                # x_1_1, ..., x_3_8
                # y_1_1, ..., y_3_8
                @nexprs 3 k -> y_k = @ntuple 8 i -> tran[block+(i-1), row+(k-1)]
                @nexprs 3 k -> x_k = @ntuple 8 i -> data[block+(i-1), col+(k-1)]

                @nexprs 3 k -> begin
                @nexprs 3 l -> begin
                    z = x_k .+ y_l
                    v_k_l = @fastmath min.(v_k_l, z)
                end
                end
            end
            
            @nexprs 3 l -> begin
            @nexprs 3 k -> begin
            out_row_l = row+(l-1)
            out_col_k = col+(k-1)
            if out_row_l <= n && out_col_k <= n
                res[out_col_k, out_row_l] = @fastmath min(v_k_l...)
            end
            end
            end
        end
    end
end

I haven't looked into it in detail yet, but I suspect some difference due to a change in reduction behavior. The code (if it were taking advantage of everything it should) should run in ~2.2s on my machine, so something else is amiss too, but this is the regression I observed during development. I'll try to dig into this a bit tomorrow, looking into differences in generated LLVM IR and assembly.

Seelengrab mentioned this issue Jan 12, 2023

Don't unroll nested loops too aggressively #48236

Open

brenhinkeller added performance Must go faster regression Regression in behavior compared to a previous version labels Jan 18, 2023

Seelengrab mentioned this issue Jan 26, 2023

Consider removing --check-bounds=no? #48245

Open

matthias314 mentioned this issue Feb 26, 2023

performance regression on master for folds #48792

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression of tight loops on tuples between `1.9.0-DEV.1180` and `1.10.0-DEV.281` #48240

Performance regression of tight loops on tuples between `1.9.0-DEV.1180` and `1.10.0-DEV.281` #48240

Seelengrab commented Jan 11, 2023

Performance regression of tight loops on tuples between 1.9.0-DEV.1180 and 1.10.0-DEV.281 #48240

Performance regression of tight loops on tuples between 1.9.0-DEV.1180 and 1.10.0-DEV.281 #48240

Comments

Seelengrab commented Jan 11, 2023

Performance regression of tight loops on tuples between `1.9.0-DEV.1180` and `1.10.0-DEV.281` #48240

Performance regression of tight loops on tuples between `1.9.0-DEV.1180` and `1.10.0-DEV.281` #48240