Much more aggressive alloc_elim_pass! #23240

yuyichao · 2017-08-13T17:58:00Z

Still WIP but this is the first version that should in theory do no worse than the old version and doesn't crash the sysimg compilation.....

This fixes a large number of issues including

enumerate allocates for arrays of mutables #16190 (which is NOT a dup of the issue it is linked to)
A more greedy version of Eliding SubArray construction (getting greedy) #10899 (the issue itself is already fixed)
Julep: Revised iteration protocol #18823 (comment)
mod. and floor. in combination are much slower than if seperated #22999
The type instability when assigning a variable multiple leafttypes that doesn't have overlapping use.
Replaces Handle setfield! in alloc_elim_pass #16969

This is almost as much as one can optimize in a control flow insensitive pass. A few optimization that are not implemented yet are

Elide allocation with there are multiple assignments (DONE)
Lazy allocation of objects in error path so that SubArray creation with non-inbounds access can also be elided.

This relies on a linear IR so I wrote a simple linearize pass before running this one. The main reason for this and the hardest/longes part of this is to keep track of defs and uses as the IR is being updated to avoid a full scan of the IR after each change. A verifying function is added to make sure the infomation is (manually) kept up-to-date (disabled even for debug build since it's really slow (full scan of the function after each change)).

There are a few know test failure but they are mostly related to linear ir.

I'm not confident that this won't crash but let's try. Just a warning that in case this crashes Pkg the nanosoldier might need a manual reset.
@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-08-13T21:24:05Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

yuyichao · 2017-08-13T21:40:04Z

I'm quite surprised that the first version didn't crash nanosoldier.... Anyway, try again now with a few (correctness and performance) bug fixed.

yuyichao · 2017-08-14T03:59:29Z

Hmmmmm, I swear I copied the nanosoldier line to my comment above but somehow it isn't there. Anyway, good that I checked.... So try again....

@nanosoldier runbenchmarks(ALL, vs=":master")

JeffBezanson · 2017-08-14T04:20:00Z

base/inference.jl

+        elseif head === :call
+            fld = expr.args[3]
+            if is_known_call(expr, isdefined, ctx.sv.src, ctx.sv.mod)
+                use_kind = 0


Should have names for these constants.

JeffBezanson · 2017-08-14T04:25:20Z

I assume it would work to enable *very-linear-mode* in the frontend instead of having a separate linearize pass here? If so, we should just do that (ref #23160).

Since this replaces a lot of the code anyway, this seems like a good opportunity to split inference.jl into two files: one for inference and one for other optimization passes.

nanosoldier · 2017-08-14T07:32:10Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

yuyichao · 2017-08-17T01:56:27Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-08-17T05:23:20Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

yuyichao · 2017-08-17T12:24:17Z

I've randomly picked a regressed benchmark and it seems to be the inference test failure that's caused by linear IR.

Should have names for these constants.

Sure... Though in this case these are strictly function local (and only on a single variable...)

I assume it would work to enable very-linear-mode in the frontend instead of having a separate linearize pass here? If so, we should just do that (ref #23160).

This does need a partiular flavor to work the most efficiently (i.e. split out undef check) Unless the progress on making other part ready is too slow, I do plan to merge this after we are fully ready to flip that switch.

Since this replaces a lot of the code anyway, this seems like a good opportunity to split inference.jl into two files: one for inference and one for other optimization passes.

At least not now to reduce conflicts. I'm also not a fan of un-namespaced internal API crossing multiple files....
When there's a clearer separation like #23276 it might be ok.

yuyichao · 2017-08-17T13:59:36Z

Inference test should be fixed. It's in its own commit since I'm not sure if this is the right fix (It looks safe and it makes sense that this works but I'm not really sure why it worked previously.....)

Check performance again.

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-08-17T17:37:18Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

yuyichao · 2017-08-19T05:35:16Z

@nanosoldier runbenchmarks(ALL, vs=":master")

vtjnash · 2017-09-20T16:29:18Z

base/inference.jl

+                    lhs = stmt.args[1]
+                    if isa(lhs, SSAValue) || isa(lhs, Slot)
+                        add_def(ctx.infomap, lhs, ValueDef(stmt, stmts, 1))
+                    e


(:symbol,) is also valid

vtjnash · 2017-09-20T16:29:55Z

base/inference.jl

+                    lhs = stmt.args[1]
+                    if isa(lhs, SSAValue) || isa(lhs, Slot)
+                        add_def(ctx.infomap, lhs, ValueDef(stmt, stmts, 1))
+                    e


this will be inferred, and isn't necessary

vtjnash · 2017-09-20T16:30:26Z

base/inference.jl

+                    lhs = stmt.args[1]
+                    if isa(lhs, SSAValue) || isa(lhs, Slot)
+                        add_def(ctx.infomap, lhs, ValueDef(stmt, stmts, 1))
+                    e


3:5 are metadata and shouldn't need to be linearized

vtjnash · 2017-09-20T16:57:53Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-09-20T20:40:14Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

yuyichao · 2017-09-24T03:36:51Z

The performance issue is caused by Expr(:boundscheck) being an expression and therefore getting linearized and affected DCE. Added a special case in linearization pass that works around it for now.

Also did some clean up of the commit and wrote proper commit messages. Probably still need clean up.

P.S. Github seems to have messed up and doesn't want to show me which line of code the three comment from @vtjnash is about. They all link to the same code and none of them applies to there AFAICT. I can only guess the ccall one is applying to the ccall linearization case.

Let's check again....

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-09-24T07:24:05Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

yuyichao · 2017-09-28T14:02:36Z

Updated to also handle @gc_preserve values just like ccall roots so this PR should be feature complete as far as my concern. (complete for this PR, not related optimization that we want/need).

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-09-28T17:40:33Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

yuyichao · 2017-09-30T16:22:16Z

The CI failures are likely mostly caused by increase in memory consumption, which in turn is almost entirely caused by the linear IR. Plotting the max rss during sysimg compilation,

the increase of memory consumption almost all come from the second commit (in actual commit order, not whatever order github shows) which is the linearization commit. A likely cause of this is the great number of metadata nodes that were previously incorrectly removed. This can be seen for a few very simple functions that are used everywhere like,

julia> @code_warntype 1 + 1.0
Variables:
  x::Int64
  y::Float64
  px<optimized out>
  py<optimized out>

Body:
  begin
      # meta: location promotion.jl promote 247
      SSAValue(3) = (Base.sitofp)(Float64, x::Int64)::Float64
      # meta: pop location
      return (Base.add_float)(SSAValue(3), y::Float64)::Float64
  end::Float64

vs

julia> @code_warntype 1 + 0.1
Variables:
  x::Int64
  y::Float64
  px<optimized out>
  py<optimized out>

Body:
  begin
      # meta: location promotion.jl promote 247
      # meta: location promotion.jl _promote 203
      # meta: location float.jl convert 57
      SSAValue(3) = (Base.sitofp)(Float64, x::Int64)::Float64
      # meta: pop location
      # meta: pop location
      # meta: pop location
      # meta: location float.jl + 391
      SSAValue(15) = (Base.add_float)(SSAValue(3), y::Float64)::Float64
      # meta: pop location
      return SSAValue(15)
  end::Float64

A constant pool (like LLVM) might be able to help (if this is actually the main reason) though that should ideally be done in another PR (and even better, in the frontend).

These objects make it really hard to mutate the AST correctly since one mutation can be accidentally done at places where it is invalid.

In particular, the features we'd like to take advantage of is mostly a limited variance of uses. This means we can easily 1. Keep track of the expression a value is used. 2. Replace the use. 3. Update the use info without extensive rescanning. This also limit the kind of sideeffect an expression arguments can have. The pass currently may change when `UndefVarError` is raised.

The hardest part for running non-local optimization passes (i.e. the transformation does not rely only on one or a few neighboring expressions) is to avoid re-analyse the code. Our current IR, though easy for linear scanning, interpreting, codegen and, to a certain degree, storage, is not very easy for making random updates. Try to workaround this issue in two ways, 1. Never resize the code array when doing updates. Instead, inserting nested arrays that we'll later splice back in for code addition and use `nothing` for code deletion. This way, the array index we cached for other metadata about the code can stay valid. 2. Based on the previous approach, pre-scan the use-def info for all variables before starting the optimization and run the optimization recursively. Code changes will also update this use-def data so that it's always valid for the user. Changes that can affect the use or def of another value will re-trigger the optimization so that we can take advantage of new optimization opportunities. This optimization pass should now handle most of the control-flow insensitive optimizations. Code patterns that are handled partially by this pass but will benefit greatly from an control-flow sensitive version includes, 1. Split slots (based on control flow) This way we can completely eliminate the surprising cost due to variable name conflicts, even when one of the def-use is not type stable. (This pass currently handles the case where all the def/uses are type stable) 2. Delay allocations There are cases where the allocation escapes but only in some branches. This will be especially for error path since we cannot eliminate some `SubArray` allocation only because we want to maintain them for the bounds error. This is very stupid and we should be able to do the allocation only when we throw the error, leaving the performance critical non-error path allocation-free. 3. Reordering assignments It is in general illegal to move an assignment when the slot assigned to is not SSA. However, there are many case that is actually legal (i.e. if there's no other use or def in between) to do so. This shows up a lot in code like ``` SSA = alloc slot = SSA ``` which we currently can't optimize since the slot can't see the assignment is actually an allocation and not a generic black box. We should be able to merge this and eliminate the SSA based on control flow info. For this case, a def info that looks through SSA values can also help.

JeffBezanson · 2017-11-16T15:30:04Z

Some of the performance regressions in #24113 are caused by extra allocations, so I'm experimenting with combining it with this PR.

The version of linear IR here seems to prohibit call and new as the argument to return, which I find contributes significantly to IR size (due to lots of small functions I guess). Any chance we could allow it? Doesn't have to be done right now of course.

Keno · 2017-12-11T19:54:28Z

I believe this is now merged, since @JeffBezanson included it in #24113, which I just merged. @yuyichao please yell and reopen if part of this didn't make it.

KristofferC mentioned this pull request Aug 13, 2017

enumerate allocates for arrays of mutables #16190

Closed

yuyichao force-pushed the yyc/typeinf/alloc branch 4 times, most recently from 6a4b19f to 5887be5 Compare August 13, 2017 21:12

yuyichao force-pushed the yyc/typeinf/alloc branch 3 times, most recently from 72fabd3 to a0752a8 Compare August 13, 2017 21:32

yuyichao mentioned this pull request Aug 13, 2017

Eliding SubArray construction (getting greedy) #10899

Closed

yuyichao force-pushed the yyc/typeinf/alloc branch from a0752a8 to a969e53 Compare August 13, 2017 22:04

yuyichao mentioned this pull request Aug 13, 2017

Backport of a few test fixes and cleanup #23241

Merged

yuyichao force-pushed the yyc/typeinf/alloc branch from a969e53 to 364f576 Compare August 13, 2017 22:09

yuyichao mentioned this pull request Aug 13, 2017

RFC: Make inbounds macros expression-like more often #15558

Closed

yuyichao force-pushed the yyc/typeinf/alloc branch from 364f576 to d66a217 Compare August 14, 2017 01:22

JeffBezanson reviewed Aug 14, 2017

View reviewed changes

yuyichao force-pushed the yyc/typeinf/alloc branch from d66a217 to 435950c Compare August 17, 2017 01:43

yuyichao force-pushed the yyc/typeinf/alloc branch 2 times, most recently from 2bf0ce3 to 3d9ee57 Compare August 17, 2017 04:31

yuyichao force-pushed the yyc/typeinf/alloc branch from 5d21c6a to 828e60e Compare August 19, 2017 05:34

vtjnash force-pushed the yyc/typeinf/alloc branch from 95157ed to 37fe43d Compare September 19, 2017 21:06

vtjnash reviewed Sep 20, 2017

View reviewed changes

yuyichao force-pushed the yyc/typeinf/alloc branch 2 times, most recently from b2a095a to 9fdc6f6 Compare September 24, 2017 03:24

yuyichao force-pushed the yyc/typeinf/alloc branch from 9fdc6f6 to 157a930 Compare September 28, 2017 13:53

yuyichao changed the title ~~WIP: Much more aggressive alloc_elim_pass!~~ Much more aggressive alloc_elim_pass! Sep 28, 2017

yuyichao force-pushed the yyc/typeinf/alloc branch from 157a930 to 11d6888 Compare September 30, 2017 03:17

yuyichao force-pushed the yyc/typeinf/alloc branch from 11d6888 to 7cc0d9f Compare September 30, 2017 16:36

yuyichao mentioned this pull request Oct 6, 2017

WIP: linear IR #24027

Closed

vtjnash mentioned this pull request Oct 12, 2017

IR: compress consecutive pop_loc nodes #24109

Merged

yuyichao added 3 commits October 21, 2017 15:19

Add a simple pass to copy Exprs that appears multiple times in the AST

2a82415

These objects make it really hard to mutate the AST correctly since one mutation can be accidentally done at places where it is invalid.

yuyichao force-pushed the yyc/typeinf/alloc branch from 7cc0d9f to 4e4983f Compare October 21, 2017 19:39

yuyichao mentioned this pull request Oct 30, 2017

Fix deprecations on 0.6 JuliaArrays/ArrayViews.jl#53

Closed

JeffBezanson mentioned this pull request Nov 15, 2017

turn on linear IR #24113

Merged

KristofferC mentioned this pull request Nov 22, 2017

colwise ridiculously slow for small columns due to allocating array views JuliaStats/Distances.jl#83

Closed

Keno closed this Dec 11, 2017

ararslan deleted the yyc/typeinf/alloc branch December 11, 2017 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Much more aggressive alloc_elim_pass! #23240

Much more aggressive alloc_elim_pass! #23240

yuyichao commented Aug 13, 2017 •

edited

Loading

nanosoldier commented Aug 13, 2017

yuyichao commented Aug 13, 2017

yuyichao commented Aug 14, 2017 •

edited

Loading

JeffBezanson Aug 14, 2017

JeffBezanson commented Aug 14, 2017

nanosoldier commented Aug 14, 2017

yuyichao commented Aug 17, 2017

nanosoldier commented Aug 17, 2017

yuyichao commented Aug 17, 2017

yuyichao commented Aug 17, 2017

nanosoldier commented Aug 17, 2017

yuyichao commented Aug 19, 2017

vtjnash Sep 20, 2017

vtjnash Sep 20, 2017 •

edited

Loading

vtjnash Sep 20, 2017

vtjnash commented Sep 20, 2017

nanosoldier commented Sep 20, 2017

yuyichao commented Sep 24, 2017

nanosoldier commented Sep 24, 2017

yuyichao commented Sep 28, 2017

nanosoldier commented Sep 28, 2017

yuyichao commented Sep 30, 2017

JeffBezanson commented Nov 16, 2017

Keno commented Dec 11, 2017

Much more aggressive alloc_elim_pass! #23240

Much more aggressive alloc_elim_pass! #23240

Conversation

yuyichao commented Aug 13, 2017 • edited Loading

nanosoldier commented Aug 13, 2017

yuyichao commented Aug 13, 2017

yuyichao commented Aug 14, 2017 • edited Loading

JeffBezanson Aug 14, 2017

Choose a reason for hiding this comment

JeffBezanson commented Aug 14, 2017

nanosoldier commented Aug 14, 2017

yuyichao commented Aug 17, 2017

nanosoldier commented Aug 17, 2017

yuyichao commented Aug 17, 2017

yuyichao commented Aug 17, 2017

nanosoldier commented Aug 17, 2017

yuyichao commented Aug 19, 2017

vtjnash Sep 20, 2017

Choose a reason for hiding this comment

vtjnash Sep 20, 2017 • edited Loading

Choose a reason for hiding this comment

vtjnash Sep 20, 2017

Choose a reason for hiding this comment

vtjnash commented Sep 20, 2017

nanosoldier commented Sep 20, 2017

yuyichao commented Sep 24, 2017

nanosoldier commented Sep 24, 2017

yuyichao commented Sep 28, 2017

nanosoldier commented Sep 28, 2017

yuyichao commented Sep 30, 2017

JeffBezanson commented Nov 16, 2017

Keno commented Dec 11, 2017

yuyichao commented Aug 13, 2017 •

edited

Loading

yuyichao commented Aug 14, 2017 •

edited

Loading

vtjnash Sep 20, 2017 •

edited

Loading