The GC root placement pass 1.0 deserves #21888

Keno · 2017-05-15T14:57:08Z

Item 1/2 on my 1.0 must have list:

This is a principled rewrite of our GC root placement pass to occur late in the pass pipeline rather than early as well as produce a more optimal placement of GC roots. Additionally, it will allow us to perform much more general transformations at the LLVM stage, because we no longer have to worry about potentially introducing unrooted values, etc. It also allows the LLVM optimizer itself to do significantly more aggressive optimizations in the presence of boxed objects.

Currently this is written again LLVM 5.0 and does use features that are new to that release, as well as the following LLVM patches:
https://reviews.llvm.org/D32593
https://reviews.llvm.org/D33110
https://reviews.llvm.org/D33129
https://reviews.llvm.org/D33179

I do think however that I can backport this to 4.0, albeit inserted closer the start of the pipeline. That wouldn't give quite as much benefit, but it should be equivalent or slightly better to what we have right now.

I have focused on correctness over compile time performance in the implementation here and I know the algorithm is not optimal (it's fair to say it's pretty naive). Additionally, there is a significant number of follow on optimizations, as well as codegen cleanup that is yet to be done. However, this is already a pretty big change, so I'm saving those for after this is merged.

Design notes are in the devdocs, implementation notes as code comments in the placement pass. There's also tests for all the tricky corner cases I found.

Fixes #20981
Fixes #15369
Supersedes #21158

timholy · 2017-05-15T15:16:33Z

Sounds really nice. Does this have any implications for temporary buffer reuse? (For example, if the root can be placed in the caller...) I'm thinking of functions along the lines of

function mymedian(v)
    vc = copy(v)
    sort!(vc)  # just for illustration, this is not optimal...
    inds = linearindices(vc)
    vc[(first(inds)+last(inds)) ÷ 2]
end

for which, if you call it repeatedly with many small vectors, currently exacts a non-trivial GC cost that can be mitigated by pre-allocating a temporary buffer and passing it in as an argument.

vtjnash · 2017-05-15T15:15:47Z

src/codegen.cpp

@@ -2362,13 +2370,15 @@ static void simple_escape_analysis(jl_value_t *expr, bool esc, jl_codectx_t *ctx
 // Emit a gc-root slot indicator
 static Value *emit_local_root(jl_codectx_t *ctx, jl_varinfo_t *vi)
 {
-    CallInst *newroot = CallInst::Create(prepare_call(gcroot_func), "", /*InsertBefore*/ctx->ptlsStates);
+    Instruction *newroot = new AllocaInst(T_prjlvalue, 0, "gcroot", /*InsertBefore*/ctx->ptlsStates);


I think for performance we will want to do:

#ifndef NDEBUG newroot->setName("gcroot"); #endif

vtjnash · 2017-05-15T15:17:57Z

src/codegen.cpp

    int slot = 0;
    assert(args.size() > 0);
+    auto lifetime_start = Intrinsic::getDeclaration(jl_Module, Intrinsic::lifetime_start, {T_pprjlvalue});
+    builder.CreateCall(lifetime_start, {


LLVM37+ syntax here, but elsewhere you seem to be maintaining LLVM33 support. Seems like the PR should just be consistent about dropping support?

vtjnash · 2017-05-15T15:21:22Z

src/codegen.cpp

    }

    JL_FEAT_REQUIRE(ctx, runtime);
-    Value *varg1 = boxed(arg1, ctx);
-    Value *varg2 = boxed(arg2, ctx, false); // potentially unrooted!
+    Value *varg1 = mark_callee_rooted(boxed(arg1, ctx));


this isn't callee-rooted (and the creation of varg2 could allocate)

That's fine, the pass will still keep it live at any safepoint that it's life across. It just won't put it in a gc root if a safepoint terminates the live interval.

vtjnash · 2017-05-15T15:23:13Z

src/codegen.cpp

@@ -2693,12 +2720,13 @@ static bool emit_builtin_call(jl_cgval_t *ret, jl_value_t *f, jl_value_t **args,
        }
        if (jl_subtype(ty, (jl_value_t*)jl_type_type)) {
            *ret = emit_expr(args[1], ctx);
-            Value *rt_ty = boxed(emit_expr(args[2], ctx), ctx);
+            Value *rt_ty = mark_callee_rooted(boxed(emit_expr(args[2], ctx), ctx));


This isn't callee-rooted unless boxed tells you that it was

Same comment as above. The placement pass figures it out. The callee rooted annotation is just to annotate that the only thing you're allowed to do with it is pass it to a function that callee roots.

OK, but type-assert also doesn't root this, so it does need to be rooted at the call safepoint by the caller / codegen

vtjnash · 2017-05-15T15:24:32Z

src/codegen.cpp

-                        retval.V),
-                    retval.gcroot);
+                builder.CreateStore(box,
+                                      retval.gcroot);


alignment here seems odd. seems like this could fit on the one line

vtjnash · 2017-05-15T15:32:40Z

src/jitlayers.cpp

@@ -129,12 +130,16 @@ void addOptimizationPasses(PassManager *PM)
    if (jl_options.opt_level == 0) {
        PM->add(createCFGSimplificationPass()); // Clean up disgusting code
        PM->add(createMemCpyOptPass()); // Remove memcpy / form memset
-        PM->add(createLowerPTLSPass(imaging_mode));
 #if JL_LLVM_VERSION >= 40000
        PM->add(createAlwaysInlinerLegacyPass()); // Respect always_inline
 #else
        PM->add(createAlwaysInlinerPass()); // Respect always_inline
 #endif


are we actually confident that this can't break your gc-invariants and end up trying to run that pass on external code that it wasn't supposed to try to run on?

Yes. It only looks at our address spaces. If external code uses our address spaces, it needs to follow our invariants.

vtjnash · 2017-05-15T15:36:08Z

src/jitlayers.cpp

@@ -489,8 +501,13 @@ JuliaOJIT::JuliaOJIT(TargetMachine &TM)
        addOptimizationPasses(&PM);
    }
    else {


could we write this as addOptimizationPasses(&PM, /* -O */0), and make the jl_options.opt_level option to that function explicit?

vtjnash · 2017-05-15T15:39:41Z

doc/src/devdocs/llvm.md

+```
+The GC root placement pass will treat the jl_roots operand bundle as if it were
+a regular operand. However, as a final step, after the gc roots are inserted,
+it will drop the operand bundle to avoid confusing codegen.


"avoid confusing machine instruction selection"?

Keno · 2017-05-15T15:49:28Z

Sounds really nice. Does this have any implications for temporary buffer reuse?

Yes and no. It makes it a lot easier to do that kind of transformation at the LLVM level, as well as making LLVM-level inlining easier. Those two combined could easily achieve what you want in your example. However, by itself this doesn't really do much there.

StefanKarpinski · 2017-05-16T20:36:16Z

doc/src/devdocs/llvm.md

+
+Minimize the number of needed gc roots/stores to them subject to the constraint
+that at every safepoint, any live gc-tracked pointer (i.e. is a path after this
+point that contains a use of this pointer).


Incomplete sentence?

yes, should have is in a gc slot ;).

StefanKarpinski · 2017-05-16T21:10:01Z

doc/src/devdocs/llvm.md

+for the function. As a result, the external rooting must be arranged while the
+value is still tracked by the system. I.e. it is not valid to attempt use the
+result of this operation to establish a global root - the optimizer may have
+already dropped the value.


Would it make any sense for pointer_from_objref to return a pointer in a GC-aware address space? Would that allow LLVM to automatically make code using such pointers safe(r)?

No, the whole point of that function is to escape the GC.

This escape semantic would also be required so that we can stack allocate RefValue{bitstype} in ccall.

StefanKarpinski · 2017-05-16T21:17:06Z

src/llvm-late-gc-lowering.cpp

+
+   Minimize the number of needed gc roots/stores to them subject to the constraint
+   that at every safepoint, any live gc-tracked pointer (i.e. is a path after this
+   point that contains a use of this pointer).


Same incomplete sentence.

StefanKarpinski · 2017-05-16T21:37:28Z

src/llvm-late-gc-lowering.cpp

+      This step performs necessary cleanup before passing the IR to codegen. In
+      particular, it removes any calls to julia_from_objref intrinsics and
+      removes the extra operand bundles from ccalls. In the future it could
+      also strip the addressspace information from all values as this


I don't think "address space" is one word

Well the llvm annotation is addrspace, so I should probably just write that.

StefanKarpinski · 2017-05-17T10:54:53Z

Great documentation and comments, btw. This is the standard we should all strive for :)

Keno · 2017-05-19T23:25:10Z

Rebased and backported to 3.9.1. On 3.9.1, it needs to run very early in the pipeline, so the improvements are not nearly as large. So I guess just think that that as an incentive to move to 5.0 post haste. I don't intend to support any prior LLVM version.

StefanKarpinski · 2017-05-20T16:41:16Z

We should definitely move master to LLVM 5 ASAP since we know that causes all sorts of havoc and needs a long time to settle into stability.

timholy · 2017-05-20T17:32:34Z

It's also a good thing to do this before we introduce too many other incompatibilities: the package ecosystem is a good test bed for shaking out LLVM bugs, but most people won't update packages to 0.7 until 0.7 is looking imminent. Consequently, the longer we wait, the more packages will be untestable for reasons unrelated to the LLVM update.

Keno · 2017-05-22T17:41:26Z

Fixed the 32bit build. The windows build seems to be failing because the LLVM patches I added here aren't being applied. @tkelman how do we rebuild the windows CI LLVM binaries?

tkelman · 2017-05-22T20:54:55Z

I have to build them by hand, though now that we build llvm as a shared library we could maybe get them from the nightlies (with some fiddling needed for headers?)

Keno · 2017-05-22T21:18:29Z

Not sure what's up with the OS X build. Passes locally. I suspect it'll also need an LLVM version bump.

tkelman · 2017-05-22T21:19:53Z

OS X travis gets its llvm binary from the juliadeps homebrew tap

Keno · 2017-05-25T19:34:02Z

@tkelman @staticfloat the patches here have been merged on master. Can we build new copies of LLVM for Win/OSX from the patchset that's currently on master?

staticfloat · 2017-05-25T19:39:06Z

@Keno can you open a PR to include the patches in this file?

Keno · 2017-05-25T21:28:47Z

@staticfloat let me know when the bottles are built so I can kick off CI again here.

staticfloat · 2017-05-25T21:35:07Z

Hmmm. Compilation failed:

/tmp/llvm39-julia-20170525-79976-1eabd2f/llvm-3.9.1.src/tools/lld/include/lld/Core/Parallel.h:125:14: error: no member named 'thread' in namespace 'std'
        std::thread([=] {
        ~~~~~^
/tmp/llvm39-julia-20170525-79976-1eabd2f/llvm-3.9.1.src/tools/lld/include/lld/Core/Parallel.h:123:10: error: no member named 'thread' in namespace 'std'
    std::thread([&, threadCount] {

Is that coming from one of our patches? This is being built with Clang: 7.0 build 700.

Keno · 2017-05-25T21:38:00Z

llvm-3.9.0_threads perhaps?

staticfloat · 2017-05-25T21:44:40Z

Ah, I see what's going on. We have further patches we need to include to get rid of the std::thread stuff within lld because the llvm3.9 formula installs it by default.

Is there any reason I should get lld compiling properly, or should I just remove it from the subprojects that get built?

vtjnash · 2017-06-14T21:25:05Z

That's starting to look much better. Still a couple left to fix, though. I know this is a pain to rebase, but please don't merge too quickly. There's many new commits (and llvm passes) on here since it's been last reviewed.

vtjnash · 2017-06-15T03:14:59Z

src/llvm-propagate-addrspaces.cpp

+        CI->setMetadata(MD.first, MD.second);
+    CI->setDebugLoc(MI.getDebugLoc());
+#endif
+ToInsert.push_back(std::make_pair(CI, &MI));


indentation

vtjnash · 2017-06-15T03:17:27Z

src/llvm-propagate-addrspaces.cpp

+        for (const auto &MD : TheMDs)
+            CI->setMetadata(MD.first, MD.second);
+        CI->setDebugLoc(MTI.getDebugLoc());
+    #endif


indentation

vtjnash · 2017-06-15T03:45:36Z

src/llvm-propagate-addrspaces.cpp

+    Value *TheFn = Intrinsic::getDeclaration(MTI.getModule(), MTI.getIntrinsicID(),
+        {Dest->getType(), Src->getType(),
+         MTI.getOperand(2)->getType()});
+    CallInst *CI = CallInst::Create(TheFn, {Dest, Src,


I think it'll be better to just do MTI->setCalledFunction(TheFn); setArgOperand(0, Dest); setArgOperand(1, Src);. That way, there's fewer things that need to be moved over to get wrong / outdated.

http://llvm.org/docs/doxygen/html/classllvm_1_1CallInst.html#a61f747edd1ca001427e02e02a320b709

vtjnash · 2017-06-15T05:55:34Z

src/llvm-propagate-addrspaces.cpp

+        for (;it != GEP->op_end(); ++it) {
+            Operands.push_back(*it);
+        }
+        NewGEP->mutateType(GetElementPtrInst::getGEPReturnType(GEPTy, CurrentV, Operands));


I think this clone mutation may be clearer / simpler if expressed as NewGEP->mutateType(PointerType::getUnqual(cast<PointerType>(GEP->getType())->getElementType(), 0))

Design notes are in the devdocs. Algorithmic documentation in code comments.

Details are in the devdocs. This scheme is signfificantly simpler.

Some LLVM passes don't handle addrspacecast too well, so try to minimize addrspace cast transitions where legal according to our invariants.

In the previous iteration of this code, timing would be double counted. This way, the individual passes are correctly registered with the top level manager, so timing works properly. It's a bit hacky but with the legacy pass manager, there's not much else to be done.

Also avoid numbering arguments early in the pipeline. Improves performance on small test cases without safepoints.

Keno · 2017-06-15T18:45:09Z

I've squashed some of the cleanup commits in the middle, addressed the latest round of feedback and addressed the latest rounds of comments. There shouldn't really be any changes to the generated code, but I'll let CI run through just in case (also in case there's interactions with master). I do plan to merge this as soon as CI is through though.

vtjnash · 2017-06-15T19:23:13Z

src/llvm-propagate-addrspaces.cpp

+        GetElementPtrInst *NewGEP = cast<GetElementPtrInst>(GEP->clone());
+        ToInsert.push_back(std::make_pair(NewGEP, GEP));
+        Type *GEPTy = GEP->getSourceElementType();
+        Type *NewRetTy = cast<PointerType>(GEP->getType())->getElementType()->getPointerTo(getValueAddrSpace(CurrentV));


If you're putting an addrspace on this, there should be one on the BitCast below too, otherwise the BitCast inst would be attempting to change AddrSpace also.

In practice the adddrspace is always 0 (on non-gpu targets anyway), so it's fine, but sure, I'll add the addrspace to the bitcast as well.

After merging that is. Otherwise, by the time we're through the CI queue again I'll have to rebase yet again.

tkelman

few minor typos etc, can be addressed in either a ci skip commit or later followup

tkelman · 2017-06-16T08:56:53Z

doc/src/devdocs/llvm.md

+jlcall calling convention. This allows us to retain the SSA-ness of the
+uses throughout the optimizer. GC root placement will later lower this call to
+the original C ABI. In the code the calling convention number is represented by
+the `JLCALL_F_CC` constant. In addition, there ist the `JLCALL_CC` calling


there is the

tkelman · 2017-06-16T08:57:24Z

doc/src/devdocs/llvm.md

+
+## GC root placement
+
+GC root placement is done by an LLVM late in the pass pipeline. Doing GC root


an LLVM pass late in the pipeline

tkelman · 2017-06-16T08:57:55Z

doc/src/devdocs/llvm.md

+placement this late enables LLVM to make more aggressive optimizations around
+code that requires GC roots, as well as allowing us to reduce the number of
+required GC roots and GC root store operations (since LLVM doesn't understand
+our GC, it wouldn't otherwise know what it is and is not allowed to do with


not allowed to do what?

tkelman · 2017-06-16T08:59:57Z

doc/src/devdocs/llvm.md

+to always be discardable without altering the semantics of the program. However,
+failing to identify a gc-tracked pointer alters the resulting program behavior
+dramatically - it'll probably crash or return wrong results. We currently use
+three different addressspaces (their numbers are defined in src/codegen_shared.cpp):


address spaces (used as 2 words in the rest of this paragraph)

tkelman · 2017-06-16T09:02:32Z

doc/src/devdocs/llvm.md

+
+First, only the following addressspace casts are allowed
+- 0->{Tracked,Derived,CalleeRooted}: It is allowable to decay an untracked pointer to any of the
+  other. However, do note that the optimizer has broad license to not root


tkelman · 2017-06-16T10:14:49Z

src/llvm-late-gc-lowering.cpp

+
+    4. GC Root coloring
+
+      Two values which are not simulataneously live at a safepoint can share the


simultaneously

tkelman · 2017-06-16T10:15:09Z

src/llvm-late-gc-lowering.cpp

+      Two values which are not simulataneously live at a safepoint can share the
+      same slot. This is an important optimization, because otherwise long
+      functions would have exceptionally large GC slots, reducing performance
+      and bloating the size of the stack. Assigning values to these slots is,


no comma after is

tkelman · 2017-06-16T10:17:40Z

src/llvm-late-gc-lowering.cpp

+      (this is beneficial for code where the primary path does not have
+      safepoints, but some other path - e.g. the error path does). However,
+      if the first safepoint is not dominated by the definition (this can
+      happen due to the non-ssa corner cases), the store is insert right after


is inserted

tkelman · 2017-06-16T10:18:29Z

src/llvm-late-gc-lowering.cpp

+      of the algorithm and their operands as uses of those values. It is
+      important to consider however WHERE the uses of PHI's operands are
+      located. It is neither at the start of the basic block, because the values
+      do not dominated the block (so can't really consider them live-in), nor


do not dominate

tkelman · 2017-06-16T10:20:05Z

src/llvm-late-gc-lowering.cpp

+                        %Bbase, %B]
+
+      We then pretend, for the purposes of numbering that %phi was derived from
+      %philift. Note that in order to be able to this, we need to be able to


in order to be able to do this,

Keno · 2017-06-16T22:10:11Z

Will do various cleanups in a followup PR.

As of the merge of #21888, we no longer support building on LLVM <= 3.9.1. This is essentialyl the mechanical cleanup to rip out all code that we had to support building on older LLVM versions. There may still be some residual support left in places and of course some things can now be cleaned up further, but this should get us started.

As of the merge of #21888, we no longer support building on LLVM <= 3.9.1. This is essentially the mechanical cleanup to rip out all code that we had to support building on older LLVM versions. There may still be some residual support left in places and of course some things can now be cleaned up further, but this should get us started.

Keno force-pushed the kf/gcroots branch from 3e967a8 to cd0a775 Compare May 15, 2017 14:58

JeffBezanson added compiler:codegen Generation of LLVM IR and native code performance Must go faster labels May 15, 2017

vtjnash reviewed May 15, 2017

View reviewed changes

Keno force-pushed the kf/gcroots branch from cd0a775 to d4b9f3a Compare May 15, 2017 17:55

Keno mentioned this pull request May 16, 2017

conservative stack scanning? #11714

Closed

StefanKarpinski reviewed May 16, 2017

View reviewed changes

Keno force-pushed the kf/gcroots branch 2 times, most recently from 8d19b38 to eb49b3d Compare May 19, 2017 23:23

Keno force-pushed the kf/gcroots branch from eb49b3d to b9d111b Compare May 20, 2017 15:56

Keno mentioned this pull request May 22, 2017

Add LLVM patches for new GC rooting #22022

Merged

vtjnash reviewed Jun 15, 2017

View reviewed changes

Keno and others added 7 commits June 15, 2017 14:19

Introduce new GC root placement pass

7227007

Design notes are in the devdocs. Algorithmic documentation in code comments.

Remove old GC placement pass

3a9fa43

Remove unnecessary MaybeNotePhiJLCallFrameUses

cff6a5c

Change IR representation of jlcall frames

eefbde7

Details are in the devdocs. This scheme is signfificantly simpler.

Add a pass to propagate addrspace information

226bca3

Some LLVM passes don't handle addrspacecast too well, so try to minimize addrspace cast transitions where legal according to our invariants.

Don't color roots that don't need it

b0a162c

Also avoid numbering arguments early in the pipeline. Improves performance on small test cases without safepoints.

Keno force-pushed the kf/gcroots branch from 693747a to b0a162c Compare June 15, 2017 18:43

vtjnash reviewed Jun 15, 2017

View reviewed changes

timholy mentioned this pull request Jun 16, 2017

Should DomainErrors accept arguments? #12152

Closed

tkelman reviewed Jun 16, 2017

View reviewed changes

Keno merged commit 42e1f63 into master Jun 16, 2017

ararslan deleted the kf/gcroots branch June 16, 2017 23:05

This was referenced Jun 16, 2017

Silence Clang warnings for GC root placement pass #22395

Merged

Minor formatting fixes for the LLVM devdocs #22397

Merged

Keno mentioned this pull request Jun 17, 2017

Drop support for LLVM <= 3.9.1 #22401

Merged

This was referenced Jun 17, 2017

LLVM 4.0 patch list #22410

Merged

Reuse gcframe for jlcall #22417

Open

maleadt mentioned this pull request Jun 20, 2017

NVPTX does not support Julia's address spaces JuliaGPU/CUDAnative.jl#73

Closed

timholy mentioned this pull request Jun 29, 2017

Want non-allocating array views #14955

Closed

vtjnash mentioned this pull request Sep 12, 2017

try to unreference objects sooner in generated code #8393

Closed

vchuravy mentioned this pull request Feb 27, 2018

LLVM assertion error on PPC #26221

Closed


		## GC root placement

		GC root placement is done by an LLVM late in the pass pipeline. Doing GC root


		4. GC Root coloring

		Two values which are not simulataneously live at a safepoint can share the

The GC root placement pass 1.0 deserves #21888

The GC root placement pass 1.0 deserves #21888

Conversation

Keno commented May 15, 2017 • edited Loading

timholy commented May 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Keno commented May 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski commented May 17, 2017

Keno commented May 19, 2017 • edited Loading

StefanKarpinski commented May 20, 2017

timholy commented May 20, 2017

Keno commented May 22, 2017

tkelman commented May 22, 2017

Keno commented May 22, 2017

tkelman commented May 22, 2017

Keno commented May 25, 2017

staticfloat commented May 25, 2017

Keno commented May 25, 2017

staticfloat commented May 25, 2017

Keno commented May 25, 2017

staticfloat commented May 25, 2017

vtjnash commented Jun 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Keno commented Jun 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Keno commented Jun 16, 2017

Keno commented May 15, 2017 •

edited

Loading

Keno commented May 19, 2017 •

edited

Loading