Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The AllocOptPass is potentially slow (and runs too often) in some cases #54524

Open
KristofferC opened this issue May 20, 2024 · 3 comments
Open
Labels
broadcast Applying a function over a collection compiler:codegen Generation of LLVM IR and native code performance Must go faster

Comments

@KristofferC
Copy link
Member

KristofferC commented May 20, 2024

Looking a bit into why #54520 caused such a big latency improvement, the issue (or at least one issue) is that with the broadcast code a large majority of the time is spent in our own alloc optimization pass:

image

image

As can be seen, this pass runs four times, each time taking quite a long time. The time spent in this pass almost completely disappears when the broadcasting code is replaced with a loop (as was done in #54520).

It might be possible with some latency gains here if one can:

  • Figure out the reason this pass seems to have such a problem with the broadcasting code
  • Understanding if it really is required to run the pass four times.
@KristofferC KristofferC added performance Must go faster compiler:codegen Generation of LLVM IR and native code broadcast Applying a function over a collection labels May 20, 2024
@maleadt
Copy link
Member

maleadt commented May 21, 2024

FWIW, the code generated by LinearAlgebra.exp is pretty large:

julia> write("file.ll", sprint(io->code_llvm(io, LinearAlgebra.exp!, Tuple{Array{Float64, 2}};
             optimize=false, debuginfo=:none, dump_module=true)))

shell> ls -lah *.ll
-rw-r--r-- 1 tim tim 529K May 21 13:51 1.10.ll
-rw-r--r-- 1 tim tim 1.7M May 21 13:54 1.11.ll
-rw-r--r-- 1 tim tim 1.8M May 21 13:53 1.12-broadcast.ll
-rw-r--r-- 1 tim tim 548K May 21 13:53 1.12-loop.ll

@KristofferC
Copy link
Member Author

Yes, #43200 makes it better:

-rw-r--r--  1 kristoffercarlsson  staff   1.8M May 21 14:44 exp_1.11.ll
-rw-r--r--  1 kristoffercarlsson  staff   491K May 21 14:36 exp_1.11_noinline.ll

but it doesn't seem to impact compilation time much:

# current
julia> @time @eval LinearAlgebra.exp!(rand(2,2))
  1.988244 seconds (8.59 M allocations: 422.911 MiB, 4.09% gc time, 99.49% compilation time)

# no @inline
julia> @time @eval LinearAlgebra.exp!(rand(2,2))
  2.040852 seconds (7.57 M allocations: 370.544 MiB, 3.47% gc time, 99.48% compilation time)

@KristofferC
Copy link
Member Author

I guss a question is, is it reasonable for

function f(A::StridedMatrix{T}) where T
    CC = T[64764752532480000.,32382376266240000.,7771770303897600.,
           1187353796428800.,  129060195264000.,  10559470521600.,
            670442572800.,      33522128640.,      1323241920.,
            40840800.,           960960.,           16380.,
            182.,                1.]

    tmp1, tmp2 = similar(A), similar(A)

    tmp1 .= CC[14].*A .+ CC[12].*A .+ CC[10].*A
    tmp2 .= CC[8].*A .+ CC[6].*A .+ CC[4].*A

    tmp1 .= CC[13].*A .+ CC[11].*A .+ CC[9].*A
    tmp2 .= CC[7].*A .+ CC[5].*A .+ CC[3].*A

    return tmp1, tmp2
end

write("file.ll", sprint(io->code_llvm(io, f, Tuple{Array{Float64, 2}};
             optimize=false, debuginfo=:none, dump_module=true)))

to expand to 13k lines of LLVM IR+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection compiler:codegen Generation of LLVM IR and native code performance Must go faster
Projects
None yet
Development

No branches or pull requests

2 participants