Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression of tight loops on tuples between 1.9.0-DEV.1180 and 1.10.0-DEV.281 #48240

Open
Seelengrab opened this issue Jan 11, 2023 · 0 comments
Labels
performance Must go faster regression Regression in behavior compared to a previous version

Comments

@Seelengrab
Copy link
Contributor

julia> @benchmark v4!(r, d, n) setup=(n=4000;r=zeros(Float32,n,n);d=rand(Float32,n,n);d[1:(n+1):(n*n)].=0f0) evals=1 seconds=60
BenchmarkTools.Trial: 15 samples with 1 evaluation.
 Range (min  max):  4.070 s    4.434 s  ┊ GC (min  max): 0.03%  0.03%
 Time  (median):     4.169 s              ┊ GC (median):    0.03%
 Time  (mean ± σ):   4.185 s ± 95.457 ms  ┊ GC (mean ± σ):  0.03% ± 0.00%

  ▁ ▁▁▁    █   ▁ ▁  ▁   ▁▁ ▁    ▁ ▁                       ▁  
  █▁███▁▁▁▁█▁▁▁█▁█▁▁█▁▁▁██▁█▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.07 s         Histogram: frequency by time        4.43 s <

 Memory estimate: 122.14 MiB, allocs estimate: 56.

julia> versioninfo()
Julia Version 1.10.0-DEV.281
Commit 77fa4cb175* (2023-01-05 12:47 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_NUM_THREADS = 4

vs.

julia> @benchmark v4!(r, d, n) setup=(n=4000;r=zeros(Float32,n,n);d=rand(Float32,n,n);d[1:(n+1):(n*n)].=0f0) evals=1 seconds=60
BenchmarkTools.Trial: 18 samples with 1 evaluation.
 Range (min  max):  3.308 s     3.936 s  ┊ GC (min  max): 0.02%  0.01%
 Time  (median):     3.410 s               ┊ GC (median):    0.02%
 Time  (mean ± σ):   3.477 s ± 167.066 ms  ┊ GC (mean ± σ):  0.02% ± 0.00%

        ▃ █      ▃                                            
  ▇▇▇▁▇▁█▁█▇▁▁▇▁▁█▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▇▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  3.31 s         Histogram: frequency by time         3.94 s <

 Memory estimate: 122.14 MiB, allocs estimate: 55.

julia> versioninfo()
Julia Version 1.9.0-DEV.1180
Commit 36aab14a97* (2022-08-25 10:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.5 (ORCJIT, skylake)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_NUM_THREADS = 4

The code is

Code
using Base.Threads
using Base.Cartesian

function v4!(res, data_, n)
    @boundscheck checkbounds(res, 1:n)
    @boundscheck checkbounds(data_, 1:n)
    vectorwidth = 8
    nvectors = div(n + vectorwidth - 1, vectorwidth)
    colsize = nvectors*vectorwidth # number of rows

    blockwidth = 3
    nblocks = div(n + blockwidth - 1, blockwidth)
    rowsize = nblocks*blockwidth # number of columns

    data = Matrix{Float32}(undef, colsize, rowsize)
    tran = Matrix{Float32}(undef, colsize, rowsize)
    Threads.@threads for row in 1:rowsize
        @inbounds for col in 1:n
            data[col,row] = row <= n ? data_[col,row] : Inf32
            tran[col,row] = row <= n ? data_[row,col] : Inf32
        end
    end
    @inbounds for row in 1:rowsize
        for col in n+1:colsize
            data[col,row] = Inf32
            tran[col,row] = Inf32
        end
    end

    Threads.@threads for row in 1:blockwidth:n
        @inbounds for col in 1:blockwidth:n
            # v_1_1_1, v_1_1_2, ..., v_2_1_1, ..., v_3_3_8
            @nexprs 3 l -> begin
            @nexprs 3 k -> begin
                v_k_l = @ntuple 8 _ -> Inf32
            end
            end

            for block in 1:vectorwidth:colsize
                # x_1_1, ..., x_3_8
                # y_1_1, ..., y_3_8
                @nexprs 3 k -> y_k = @ntuple 8 i -> tran[block+(i-1), row+(k-1)]
                @nexprs 3 k -> x_k = @ntuple 8 i -> data[block+(i-1), col+(k-1)]

                @nexprs 3 k -> begin
                @nexprs 3 l -> begin
                    z = x_k .+ y_l
                    v_k_l = @fastmath min.(v_k_l, z)
                end
                end
            end
            
            @nexprs 3 l -> begin
            @nexprs 3 k -> begin
            out_row_l = row+(l-1)
            out_col_k = col+(k-1)
            if out_row_l <= n && out_col_k <= n
                res[out_col_k, out_row_l] = @fastmath min(v_k_l...)
            end
            end
            end
        end
    end
end

I haven't looked into it in detail yet, but I suspect some difference due to a change in reduction behavior. The code (if it were taking advantage of everything it should) should run in ~2.2s on my machine, so something else is amiss too, but this is the regression I observed during development. I'll try to dig into this a bit tomorrow, looking into differences in generated LLVM IR and assembly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

2 participants