The AllocOptPass is potentially slow (and runs too often) in some cases #54524

KristofferC · 2024-05-20T15:51:11Z

Looking a bit into why #54520 caused such a big latency improvement, the issue (or at least one issue) is that with the broadcast code a large majority of the time is spent in our own alloc optimization pass:

As can be seen, this pass runs four times, each time taking quite a long time. The time spent in this pass almost completely disappears when the broadcasting code is replaced with a loop (as was done in #54520).

It might be possible with some latency gains here if one can:

Figure out the reason this pass seems to have such a problem with the broadcasting code
Understanding if it really is required to run the pass four times.

maleadt · 2024-05-21T11:56:42Z

FWIW, the code generated by LinearAlgebra.exp is pretty large:

julia> write("file.ll", sprint(io->code_llvm(io, LinearAlgebra.exp!, Tuple{Array{Float64, 2}};
             optimize=false, debuginfo=:none, dump_module=true)))

shell> ls -lah *.ll
-rw-r--r-- 1 tim tim 529K May 21 13:51 1.10.ll
-rw-r--r-- 1 tim tim 1.7M May 21 13:54 1.11.ll
-rw-r--r-- 1 tim tim 1.8M May 21 13:53 1.12-broadcast.ll
-rw-r--r-- 1 tim tim 548K May 21 13:53 1.12-loop.ll

KristofferC · 2024-05-21T12:46:05Z

Yes, #43200 makes it better:

-rw-r--r--  1 kristoffercarlsson  staff   1.8M May 21 14:44 exp_1.11.ll
-rw-r--r--  1 kristoffercarlsson  staff   491K May 21 14:36 exp_1.11_noinline.ll

but it doesn't seem to impact compilation time much:

# current
julia> @time @eval LinearAlgebra.exp!(rand(2,2))
  1.988244 seconds (8.59 M allocations: 422.911 MiB, 4.09% gc time, 99.49% compilation time)

# no @inline
julia> @time @eval LinearAlgebra.exp!(rand(2,2))
  2.040852 seconds (7.57 M allocations: 370.544 MiB, 3.47% gc time, 99.48% compilation time)

KristofferC · 2024-05-21T15:30:05Z

I guss a question is, is it reasonable for

function f(A::StridedMatrix{T}) where T
    CC = T[64764752532480000.,32382376266240000.,7771770303897600.,
           1187353796428800.,  129060195264000.,  10559470521600.,
            670442572800.,      33522128640.,      1323241920.,
            40840800.,           960960.,           16380.,
            182.,                1.]

    tmp1, tmp2 = similar(A), similar(A)

    tmp1 .= CC[14].*A .+ CC[12].*A .+ CC[10].*A
    tmp2 .= CC[8].*A .+ CC[6].*A .+ CC[4].*A

    tmp1 .= CC[13].*A .+ CC[11].*A .+ CC[9].*A
    tmp2 .= CC[7].*A .+ CC[5].*A .+ CC[3].*A

    return tmp1, tmp2
end

write("file.ll", sprint(io->code_llvm(io, f, Tuple{Array{Float64, 2}};
             optimize=false, debuginfo=:none, dump_module=true)))

to expand to 13k lines of LLVM IR+

KristofferC added performance Must go faster compiler:codegen Generation of LLVM IR and native code broadcast Applying a function over a collection labels May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The AllocOptPass is potentially slow (and runs too often) in some cases #54524

The AllocOptPass is potentially slow (and runs too often) in some cases #54524

KristofferC commented May 20, 2024 •

edited

Loading

maleadt commented May 21, 2024

KristofferC commented May 21, 2024

KristofferC commented May 21, 2024

The AllocOptPass is potentially slow (and runs too often) in some cases #54524

The AllocOptPass is potentially slow (and runs too often) in some cases #54524

Comments

KristofferC commented May 20, 2024 • edited Loading

maleadt commented May 21, 2024

KristofferC commented May 21, 2024

KristofferC commented May 21, 2024

KristofferC commented May 20, 2024 •

edited

Loading