Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much more aggressive alloc_elim_pass! #23240

Closed
wants to merge 3 commits into from
Closed

Conversation

yuyichao
Copy link
Contributor

@yuyichao yuyichao commented Aug 13, 2017

Still WIP but this is the first version that should in theory do no worse than the old version and doesn't crash the sysimg compilation.....

This fixes a large number of issues including

This is almost as much as one can optimize in a control flow insensitive pass. A few optimization that are not implemented yet are

  • Elide allocation with there are multiple assignments (DONE)
  • Lazy allocation of objects in error path so that SubArray creation with non-inbounds access can also be elided.

This relies on a linear IR so I wrote a simple linearize pass before running this one. The main reason for this and the hardest/longes part of this is to keep track of defs and uses as the IR is being updated to avoid a full scan of the IR after each change. A verifying function is added to make sure the infomation is (manually) kept up-to-date (disabled even for debug build since it's really slow (full scan of the function after each change)).

There are a few know test failure but they are mostly related to linear ir.

I'm not confident that this won't crash but let's try. Just a warning that in case this crashes Pkg the nanosoldier might need a manual reset.
@nanosoldier runbenchmarks(ALL, vs=":master")

@yuyichao yuyichao force-pushed the yyc/typeinf/alloc branch 4 times, most recently from 6a4b19f to 5887be5 Compare August 13, 2017 21:12
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@yuyichao yuyichao force-pushed the yyc/typeinf/alloc branch 3 times, most recently from 72fabd3 to a0752a8 Compare August 13, 2017 21:32
@yuyichao
Copy link
Contributor Author

I'm quite surprised that the first version didn't crash nanosoldier.... Anyway, try again now with a few (correctness and performance) bug fixed.

@yuyichao
Copy link
Contributor Author

yuyichao commented Aug 14, 2017

Hmmmmm, I swear I copied the nanosoldier line to my comment above but somehow it isn't there. Anyway, good that I checked.... So try again....

@nanosoldier runbenchmarks(ALL, vs=":master")

elseif head === :call
fld = expr.args[3]
if is_known_call(expr, isdefined, ctx.sv.src, ctx.sv.mod)
use_kind = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have names for these constants.

@JeffBezanson
Copy link
Member

I assume it would work to enable *very-linear-mode* in the frontend instead of having a separate linearize pass here? If so, we should just do that (ref #23160).

Since this replaces a lot of the code anyway, this seems like a good opportunity to split inference.jl into two files: one for inference and one for other optimization passes.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@yuyichao
Copy link
Contributor Author

@nanosoldier runbenchmarks(ALL, vs=":master")

@yuyichao yuyichao force-pushed the yyc/typeinf/alloc branch 2 times, most recently from 2bf0ce3 to 3d9ee57 Compare August 17, 2017 04:31
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@yuyichao
Copy link
Contributor Author

I've randomly picked a regressed benchmark and it seems to be the inference test failure that's caused by linear IR.

Should have names for these constants.

Sure... Though in this case these are strictly function local (and only on a single variable...)

I assume it would work to enable very-linear-mode in the frontend instead of having a separate linearize pass here? If so, we should just do that (ref #23160).

This does need a partiular flavor to work the most efficiently (i.e. split out undef check) Unless the progress on making other part ready is too slow, I do plan to merge this after we are fully ready to flip that switch.

Since this replaces a lot of the code anyway, this seems like a good opportunity to split inference.jl into two files: one for inference and one for other optimization passes.

At least not now to reduce conflicts. I'm also not a fan of un-namespaced internal API crossing multiple files....
When there's a clearer separation like #23276 it might be ok.

@yuyichao
Copy link
Contributor Author

Inference test should be fixed. It's in its own commit since I'm not sure if this is the right fix (It looks safe and it makes sense that this works but I'm not really sure why it worked previously.....)

Check performance again.

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@yuyichao
Copy link
Contributor Author

@nanosoldier runbenchmarks(ALL, vs=":master")

lhs = stmt.args[1]
if isa(lhs, SSAValue) || isa(lhs, Slot)
add_def(ctx.infomap, lhs, ValueDef(stmt, stmts, 1))
e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(:symbol,) is also valid

lhs = stmt.args[1]
if isa(lhs, SSAValue) || isa(lhs, Slot)
add_def(ctx.infomap, lhs, ValueDef(stmt, stmts, 1))
e
Copy link
Member

@vtjnash vtjnash Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be inferred, and isn't necessary

lhs = stmt.args[1]
if isa(lhs, SSAValue) || isa(lhs, Slot)
add_def(ctx.infomap, lhs, ValueDef(stmt, stmts, 1))
e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3:5 are metadata and shouldn't need to be linearized

@vtjnash
Copy link
Member

vtjnash commented Sep 20, 2017

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@yuyichao yuyichao force-pushed the yyc/typeinf/alloc branch 2 times, most recently from b2a095a to 9fdc6f6 Compare September 24, 2017 03:24
@yuyichao
Copy link
Contributor Author

The performance issue is caused by Expr(:boundscheck) being an expression and therefore getting linearized and affected DCE. Added a special case in linearization pass that works around it for now.

Also did some clean up of the commit and wrote proper commit messages. Probably still need clean up.

P.S. Github seems to have messed up and doesn't want to show me which line of code the three comment from @vtjnash is about. They all link to the same code and none of them applies to there AFAICT. I can only guess the ccall one is applying to the ccall linearization case.

Let's check again....

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@yuyichao yuyichao changed the title WIP: Much more aggressive alloc_elim_pass! Much more aggressive alloc_elim_pass! Sep 28, 2017
@yuyichao
Copy link
Contributor Author

Updated to also handle @gc_preserve values just like ccall roots so this PR should be feature complete as far as my concern. (complete for this PR, not related optimization that we want/need).

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@yuyichao
Copy link
Contributor Author

The CI failures are likely mostly caused by increase in memory consumption, which in turn is almost entirely caused by the linear IR. Plotting the max rss during sysimg compilation,
maxrss
the increase of memory consumption almost all come from the second commit (in actual commit order, not whatever order github shows) which is the linearization commit. A likely cause of this is the great number of metadata nodes that were previously incorrectly removed. This can be seen for a few very simple functions that are used everywhere like,

julia> @code_warntype 1 + 1.0
Variables:
  x::Int64
  y::Float64
  px<optimized out>
  py<optimized out>

Body:
  begin
      # meta: location promotion.jl promote 247
      SSAValue(3) = (Base.sitofp)(Float64, x::Int64)::Float64
      # meta: pop location
      return (Base.add_float)(SSAValue(3), y::Float64)::Float64
  end::Float64

vs

julia> @code_warntype 1 + 0.1
Variables:
  x::Int64
  y::Float64
  px<optimized out>
  py<optimized out>

Body:
  begin
      # meta: location promotion.jl promote 247
      # meta: location promotion.jl _promote 203
      # meta: location float.jl convert 57
      SSAValue(3) = (Base.sitofp)(Float64, x::Int64)::Float64
      # meta: pop location
      # meta: pop location
      # meta: pop location
      # meta: location float.jl + 391
      SSAValue(15) = (Base.add_float)(SSAValue(3), y::Float64)::Float64
      # meta: pop location
      return SSAValue(15)
  end::Float64

A constant pool (like LLVM) might be able to help (if this is actually the main reason) though that should ideally be done in another PR (and even better, in the frontend).

These objects make it really hard to mutate the AST correctly
since one mutation can be accidentally done at places where it is invalid.
In particular, the features we'd like to take advantage of is mostly a limited variance of uses.
This means we can easily

1. Keep track of the expression a value is used.
2. Replace the use.
3. Update the use info without extensive rescanning.

This also limit the kind of sideeffect an expression arguments can have.

The pass currently may change when `UndefVarError` is raised.
The hardest part for running non-local optimization passes
(i.e. the transformation does not rely only on one or a few neighboring expressions)
is to avoid re-analyse the code. Our current IR, though easy for linear scanning,
interpreting, codegen and, to a certain degree, storage, is not very easy for making random
updates. Try to workaround this issue in two ways,

1. Never resize the code array when doing updates.
   Instead, inserting nested arrays that we'll later splice back in for code addition and
   use `nothing` for code deletion. This way, the array index we cached for other metadata
   about the code can stay valid.
2. Based on the previous approach, pre-scan the use-def info for all variables before starting
   the optimization and run the optimization recursively.
   Code changes will also update this use-def data so that it's always valid for the user.
   Changes that can affect the use or def of another value will re-trigger the optimization
   so that we can take advantage of new optimization opportunities.

This optimization pass should now handle most of the control-flow insensitive optimizations.
Code patterns that are handled partially by this pass but will benefit greatly from an
control-flow sensitive version includes,

1. Split slots (based on control flow)

   This way we can completely eliminate the surprising cost due to variable name conflicts,
   even when one of the def-use is not type stable.
   (This pass currently handles the case where all the def/uses are type stable)

2. Delay allocations

   There are cases where the allocation escapes but only in some branches.
   This will be especially for error path since we cannot eliminate some `SubArray` allocation
   only because we want to maintain them for the bounds error. This is very stupid and we should
   be able to do the allocation only when we throw the error, leaving the performance critical
   non-error path allocation-free.

3. Reordering assignments

   It is in general illegal to move an assignment when the slot assigned to is not SSA.
   However, there are many case that is actually legal
   (i.e. if there's no other use or def in between) to do so. This shows up a lot in code like

   ```
   SSA = alloc
   slot = SSA
   ```

   which we currently can't optimize since the slot can't see the assignment is actually
   an allocation and not a generic black box. We should be able to merge this and eliminate the
   SSA based on control flow info. For this case, a def info that looks through SSA values
   can also help.
@JeffBezanson
Copy link
Member

Some of the performance regressions in #24113 are caused by extra allocations, so I'm experimenting with combining it with this PR.

The version of linear IR here seems to prohibit call and new as the argument to return, which I find contributes significantly to IR size (due to lots of small functions I guess). Any chance we could allow it? Doesn't have to be done right now of course.

@Keno
Copy link
Member

Keno commented Dec 11, 2017

I believe this is now merged, since @JeffBezanson included it in #24113, which I just merged. @yuyichao please yell and reopen if part of this didn't make it.

@Keno Keno closed this Dec 11, 2017
@ararslan ararslan deleted the yyc/typeinf/alloc branch December 11, 2017 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants