-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Much more aggressive alloc_elim_pass! #23240
Conversation
6a4b19f
to
5887be5
Compare
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
72fabd3
to
a0752a8
Compare
I'm quite surprised that the first version didn't crash nanosoldier.... Anyway, try again now with a few (correctness and performance) bug fixed. |
a0752a8
to
a969e53
Compare
a969e53
to
364f576
Compare
364f576
to
d66a217
Compare
Hmmmmm, I swear I copied the nanosoldier line to my comment above but somehow it isn't there. Anyway, good that I checked.... So try again.... @nanosoldier |
elseif head === :call | ||
fld = expr.args[3] | ||
if is_known_call(expr, isdefined, ctx.sv.src, ctx.sv.mod) | ||
use_kind = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have names for these constants.
I assume it would work to enable Since this replaces a lot of the code anyway, this seems like a good opportunity to split inference.jl into two files: one for inference and one for other optimization passes. |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
d66a217
to
435950c
Compare
@nanosoldier |
2bf0ce3
to
3d9ee57
Compare
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
I've randomly picked a regressed benchmark and it seems to be the inference test failure that's caused by linear IR.
Sure... Though in this case these are strictly function local (and only on a single variable...)
This does need a partiular flavor to work the most efficiently (i.e. split out undef check) Unless the progress on making other part ready is too slow, I do plan to merge this after we are fully ready to flip that switch.
At least not now to reduce conflicts. I'm also not a fan of un-namespaced internal API crossing multiple files.... |
Inference test should be fixed. It's in its own commit since I'm not sure if this is the right fix (It looks safe and it makes sense that this works but I'm not really sure why it worked previously.....) Check performance again. @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
5d21c6a
to
828e60e
Compare
@nanosoldier |
95157ed
to
37fe43d
Compare
base/inference.jl
Outdated
lhs = stmt.args[1] | ||
if isa(lhs, SSAValue) || isa(lhs, Slot) | ||
add_def(ctx.infomap, lhs, ValueDef(stmt, stmts, 1)) | ||
e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(:symbol,)
is also valid
base/inference.jl
Outdated
lhs = stmt.args[1] | ||
if isa(lhs, SSAValue) || isa(lhs, Slot) | ||
add_def(ctx.infomap, lhs, ValueDef(stmt, stmts, 1)) | ||
e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be inferred, and isn't necessary
base/inference.jl
Outdated
lhs = stmt.args[1] | ||
if isa(lhs, SSAValue) || isa(lhs, Slot) | ||
add_def(ctx.infomap, lhs, ValueDef(stmt, stmts, 1)) | ||
e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3:5 are metadata and shouldn't need to be linearized
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
b2a095a
to
9fdc6f6
Compare
The performance issue is caused by Also did some clean up of the commit and wrote proper commit messages. Probably still need clean up. P.S. Github seems to have messed up and doesn't want to show me which line of code the three comment from @vtjnash is about. They all link to the same code and none of them applies to there AFAICT. I can only guess the ccall one is applying to the ccall linearization case. Let's check again.... @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
9fdc6f6
to
157a930
Compare
Updated to also handle @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
157a930
to
11d6888
Compare
11d6888
to
7cc0d9f
Compare
These objects make it really hard to mutate the AST correctly since one mutation can be accidentally done at places where it is invalid.
In particular, the features we'd like to take advantage of is mostly a limited variance of uses. This means we can easily 1. Keep track of the expression a value is used. 2. Replace the use. 3. Update the use info without extensive rescanning. This also limit the kind of sideeffect an expression arguments can have. The pass currently may change when `UndefVarError` is raised.
The hardest part for running non-local optimization passes (i.e. the transformation does not rely only on one or a few neighboring expressions) is to avoid re-analyse the code. Our current IR, though easy for linear scanning, interpreting, codegen and, to a certain degree, storage, is not very easy for making random updates. Try to workaround this issue in two ways, 1. Never resize the code array when doing updates. Instead, inserting nested arrays that we'll later splice back in for code addition and use `nothing` for code deletion. This way, the array index we cached for other metadata about the code can stay valid. 2. Based on the previous approach, pre-scan the use-def info for all variables before starting the optimization and run the optimization recursively. Code changes will also update this use-def data so that it's always valid for the user. Changes that can affect the use or def of another value will re-trigger the optimization so that we can take advantage of new optimization opportunities. This optimization pass should now handle most of the control-flow insensitive optimizations. Code patterns that are handled partially by this pass but will benefit greatly from an control-flow sensitive version includes, 1. Split slots (based on control flow) This way we can completely eliminate the surprising cost due to variable name conflicts, even when one of the def-use is not type stable. (This pass currently handles the case where all the def/uses are type stable) 2. Delay allocations There are cases where the allocation escapes but only in some branches. This will be especially for error path since we cannot eliminate some `SubArray` allocation only because we want to maintain them for the bounds error. This is very stupid and we should be able to do the allocation only when we throw the error, leaving the performance critical non-error path allocation-free. 3. Reordering assignments It is in general illegal to move an assignment when the slot assigned to is not SSA. However, there are many case that is actually legal (i.e. if there's no other use or def in between) to do so. This shows up a lot in code like ``` SSA = alloc slot = SSA ``` which we currently can't optimize since the slot can't see the assignment is actually an allocation and not a generic black box. We should be able to merge this and eliminate the SSA based on control flow info. For this case, a def info that looks through SSA values can also help.
7cc0d9f
to
4e4983f
Compare
Some of the performance regressions in #24113 are caused by extra allocations, so I'm experimenting with combining it with this PR. The version of linear IR here seems to prohibit |
I believe this is now merged, since @JeffBezanson included it in #24113, which I just merged. @yuyichao please yell and reopen if part of this didn't make it. |
Still WIP but this is the first version that should in theory do no worse than the old version and doesn't crash the sysimg compilation.....
This fixes a large number of issues including
enumerate
allocates for arrays of mutables #16190 (which is NOT a dup of the issue it is linked to)This is almost as much as one can optimize in a control flow insensitive pass. A few optimization that are not implemented yet are
SubArray
creation with non-inbounds access can also be elided.This relies on a linear IR so I wrote a simple linearize pass before running this one. The main reason for this and the hardest/longes part of this is to keep track of defs and uses as the IR is being updated to avoid a full scan of the IR after each change. A verifying function is added to make sure the infomation is (manually) kept up-to-date (disabled even for debug build since it's really slow (full scan of the function after each change)).
There are a few know test failure but they are mostly related to linear ir.
I'm not confident that this won't crash but let's try. Just a warning that in case this crashes Pkg the nanosoldier might need a manual reset.
@nanosoldier
runbenchmarks(ALL, vs=":master")