Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The GC root placement pass 1.0 deserves #21888

Merged
merged 7 commits into from
Jun 16, 2017
Merged

The GC root placement pass 1.0 deserves #21888

merged 7 commits into from
Jun 16, 2017

Conversation

Keno
Copy link
Member

@Keno Keno commented May 15, 2017

Item 1/2 on my 1.0 must have list:

This is a principled rewrite of our GC root placement pass to occur late in the pass pipeline rather than early as well as produce a more optimal placement of GC roots. Additionally, it will allow us to perform much more general transformations at the LLVM stage, because we no longer have to worry about potentially introducing unrooted values, etc. It also allows the LLVM optimizer itself to do significantly more aggressive optimizations in the presence of boxed objects.

Currently this is written again LLVM 5.0 and does use features that are new to that release, as well as the following LLVM patches:
https://reviews.llvm.org/D32593
https://reviews.llvm.org/D33110
https://reviews.llvm.org/D33129
https://reviews.llvm.org/D33179

I do think however that I can backport this to 4.0, albeit inserted closer the start of the pipeline. That wouldn't give quite as much benefit, but it should be equivalent or slightly better to what we have right now.

I have focused on correctness over compile time performance in the implementation here and I know the algorithm is not optimal (it's fair to say it's pretty naive). Additionally, there is a significant number of follow on optimizations, as well as codegen cleanup that is yet to be done. However, this is already a pretty big change, so I'm saving those for after this is merged.

Design notes are in the devdocs, implementation notes as code comments in the placement pass. There's also tests for all the tricky corner cases I found.

Fixes #20981
Fixes #15369
Supersedes #21158

@timholy
Copy link
Sponsor Member

timholy commented May 15, 2017

Sounds really nice. Does this have any implications for temporary buffer reuse? (For example, if the root can be placed in the caller...) I'm thinking of functions along the lines of

function mymedian(v)
    vc = copy(v)
    sort!(vc)  # just for illustration, this is not optimal...
    inds = linearindices(vc)
    vc[(first(inds)+last(inds)) ÷ 2]
end

for which, if you call it repeatedly with many small vectors, currently exacts a non-trivial GC cost that can be mitigated by pre-allocating a temporary buffer and passing it in as an argument.

@JeffBezanson JeffBezanson added compiler:codegen Generation of LLVM IR and native code performance Must go faster labels May 15, 2017
@@ -2362,13 +2370,15 @@ static void simple_escape_analysis(jl_value_t *expr, bool esc, jl_codectx_t *ctx
// Emit a gc-root slot indicator
static Value *emit_local_root(jl_codectx_t *ctx, jl_varinfo_t *vi)
{
CallInst *newroot = CallInst::Create(prepare_call(gcroot_func), "", /*InsertBefore*/ctx->ptlsStates);
Instruction *newroot = new AllocaInst(T_prjlvalue, 0, "gcroot", /*InsertBefore*/ctx->ptlsStates);
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for performance we will want to do:

#ifndef NDEBUG
newroot->setName("gcroot");
#endif

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably.

src/codegen.cpp Outdated
int slot = 0;
assert(args.size() > 0);
auto lifetime_start = Intrinsic::getDeclaration(jl_Module, Intrinsic::lifetime_start, {T_pprjlvalue});
builder.CreateCall(lifetime_start, {
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM37+ syntax here, but elsewhere you seem to be maintaining LLVM33 support. Seems like the PR should just be consistent about dropping support?

}

JL_FEAT_REQUIRE(ctx, runtime);
Value *varg1 = boxed(arg1, ctx);
Value *varg2 = boxed(arg2, ctx, false); // potentially unrooted!
Value *varg1 = mark_callee_rooted(boxed(arg1, ctx));
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't callee-rooted (and the creation of varg2 could allocate)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, the pass will still keep it live at any safepoint that it's life across. It just won't put it in a gc root if a safepoint terminates the live interval.

src/codegen.cpp Outdated
@@ -2693,12 +2720,13 @@ static bool emit_builtin_call(jl_cgval_t *ret, jl_value_t *f, jl_value_t **args,
}
if (jl_subtype(ty, (jl_value_t*)jl_type_type)) {
*ret = emit_expr(args[1], ctx);
Value *rt_ty = boxed(emit_expr(args[2], ctx), ctx);
Value *rt_ty = mark_callee_rooted(boxed(emit_expr(args[2], ctx), ctx));
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't callee-rooted unless boxed tells you that it was

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. The placement pass figures it out. The callee rooted annotation is just to annotate that the only thing you're allowed to do with it is pass it to a function that callee roots.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but type-assert also doesn't root this, so it does need to be rooted at the call safepoint by the caller / codegen

src/codegen.cpp Outdated
retval.V),
retval.gcroot);
builder.CreateStore(box,
retval.gcroot);
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment here seems odd. seems like this could fit on the one line

@@ -129,12 +130,16 @@ void addOptimizationPasses(PassManager *PM)
if (jl_options.opt_level == 0) {
PM->add(createCFGSimplificationPass()); // Clean up disgusting code
PM->add(createMemCpyOptPass()); // Remove memcpy / form memset
PM->add(createLowerPTLSPass(imaging_mode));
#if JL_LLVM_VERSION >= 40000
PM->add(createAlwaysInlinerLegacyPass()); // Respect always_inline
#else
PM->add(createAlwaysInlinerPass()); // Respect always_inline
#endif
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we actually confident that this can't break your gc-invariants and end up trying to run that pass on external code that it wasn't supposed to try to run on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It only looks at our address spaces. If external code uses our address spaces, it needs to follow our invariants.

@@ -489,8 +501,13 @@ JuliaOJIT::JuliaOJIT(TargetMachine &TM)
addOptimizationPasses(&PM);
}
else {
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we write this as addOptimizationPasses(&PM, /* -O */0), and make the jl_options.opt_level option to that function explicit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

```
The GC root placement pass will treat the jl_roots operand bundle as if it were
a regular operand. However, as a final step, after the gc roots are inserted,
it will drop the operand bundle to avoid confusing codegen.
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"avoid confusing machine instruction selection"?

@Keno
Copy link
Member Author

Keno commented May 15, 2017

Sounds really nice. Does this have any implications for temporary buffer reuse?

Yes and no. It makes it a lot easier to do that kind of transformation at the LLVM level, as well as making LLVM-level inlining easier. Those two combined could easily achieve what you want in your example. However, by itself this doesn't really do much there.


Minimize the number of needed gc roots/stores to them subject to the constraint
that at every safepoint, any live gc-tracked pointer (i.e. is a path after this
point that contains a use of this pointer).
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete sentence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, should have is in a gc slot ;).

for the function. As a result, the external rooting must be arranged while the
value is still tracked by the system. I.e. it is not valid to attempt use the
result of this operation to establish a global root - the optimizer may have
already dropped the value.
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make any sense for pointer_from_objref to return a pointer in a GC-aware address space? Would that allow LLVM to automatically make code using such pointers safe(r)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the whole point of that function is to escape the GC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This escape semantic would also be required so that we can stack allocate RefValue{bitstype} in ccall.


Minimize the number of needed gc roots/stores to them subject to the constraint
that at every safepoint, any live gc-tracked pointer (i.e. is a path after this
point that contains a use of this pointer).
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same incomplete sentence.

This step performs necessary cleanup before passing the IR to codegen. In
particular, it removes any calls to julia_from_objref intrinsics and
removes the extra operand bundles from ccalls. In the future it could
also strip the addressspace information from all values as this
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "address space" is one word

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the llvm annotation is addrspace, so I should probably just write that.

@StefanKarpinski
Copy link
Sponsor Member

Great documentation and comments, btw. This is the standard we should all strive for :)

@Keno Keno force-pushed the kf/gcroots branch 2 times, most recently from 8d19b38 to eb49b3d Compare May 19, 2017 23:23
@Keno
Copy link
Member Author

Keno commented May 19, 2017

Rebased and backported to 3.9.1. On 3.9.1, it needs to run very early in the pipeline, so the improvements are not nearly as large. So I guess just think that that as an incentive to move to 5.0 post haste. I don't intend to support any prior LLVM version.

@StefanKarpinski
Copy link
Sponsor Member

We should definitely move master to LLVM 5 ASAP since we know that causes all sorts of havoc and needs a long time to settle into stability.

@timholy
Copy link
Sponsor Member

timholy commented May 20, 2017

It's also a good thing to do this before we introduce too many other incompatibilities: the package ecosystem is a good test bed for shaking out LLVM bugs, but most people won't update packages to 0.7 until 0.7 is looking imminent. Consequently, the longer we wait, the more packages will be untestable for reasons unrelated to the LLVM update.

@Keno
Copy link
Member Author

Keno commented May 22, 2017

Fixed the 32bit build. The windows build seems to be failing because the LLVM patches I added here aren't being applied. @tkelman how do we rebuild the windows CI LLVM binaries?

@tkelman
Copy link
Contributor

tkelman commented May 22, 2017

I have to build them by hand, though now that we build llvm as a shared library we could maybe get them from the nightlies (with some fiddling needed for headers?)

@Keno
Copy link
Member Author

Keno commented May 22, 2017

Not sure what's up with the OS X build. Passes locally. I suspect it'll also need an LLVM version bump.

@tkelman
Copy link
Contributor

tkelman commented May 22, 2017

OS X travis gets its llvm binary from the juliadeps homebrew tap

@Keno
Copy link
Member Author

Keno commented May 25, 2017

@tkelman @staticfloat the patches here have been merged on master. Can we build new copies of LLVM for Win/OSX from the patchset that's currently on master?

@staticfloat
Copy link
Sponsor Member

@Keno can you open a PR to include the patches in this file?

@Keno
Copy link
Member Author

Keno commented May 25, 2017

@staticfloat let me know when the bottles are built so I can kick off CI again here.

@staticfloat
Copy link
Sponsor Member

Hmmm. Compilation failed:

/tmp/llvm39-julia-20170525-79976-1eabd2f/llvm-3.9.1.src/tools/lld/include/lld/Core/Parallel.h:125:14: error: no member named 'thread' in namespace 'std'
        std::thread([=] {
        ~~~~~^
/tmp/llvm39-julia-20170525-79976-1eabd2f/llvm-3.9.1.src/tools/lld/include/lld/Core/Parallel.h:123:10: error: no member named 'thread' in namespace 'std'
    std::thread([&, threadCount] {

Is that coming from one of our patches? This is being built with Clang: 7.0 build 700.

@Keno
Copy link
Member Author

Keno commented May 25, 2017

llvm-3.9.0_threads perhaps?

@staticfloat
Copy link
Sponsor Member

Ah, I see what's going on. We have further patches we need to include to get rid of the std::thread stuff within lld because the llvm3.9 formula installs it by default.

Is there any reason I should get lld compiling properly, or should I just remove it from the subprojects that get built?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 14, 2017

That's starting to look much better. Still a couple left to fix, though. I know this is a pain to rebase, but please don't merge too quickly. There's many new commits (and llvm passes) on here since it's been last reviewed.

CI->setMetadata(MD.first, MD.second);
CI->setDebugLoc(MI.getDebugLoc());
#endif
ToInsert.push_back(std::make_pair(CI, &MI));
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

for (const auto &MD : TheMDs)
CI->setMetadata(MD.first, MD.second);
CI->setDebugLoc(MTI.getDebugLoc());
#endif
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

Value *TheFn = Intrinsic::getDeclaration(MTI.getModule(), MTI.getIntrinsicID(),
{Dest->getType(), Src->getType(),
MTI.getOperand(2)->getType()});
CallInst *CI = CallInst::Create(TheFn, {Dest, Src,
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll be better to just do MTI->setCalledFunction(TheFn); setArgOperand(0, Dest); setArgOperand(1, Src);. That way, there's fewer things that need to be moved over to get wrong / outdated.

http://llvm.org/docs/doxygen/html/classllvm_1_1CallInst.html#a61f747edd1ca001427e02e02a320b709

for (;it != GEP->op_end(); ++it) {
Operands.push_back(*it);
}
NewGEP->mutateType(GetElementPtrInst::getGEPReturnType(GEPTy, CurrentV, Operands));
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this clone mutation may be clearer / simpler if expressed as NewGEP->mutateType(PointerType::getUnqual(cast<PointerType>(GEP->getType())->getElementType(), 0))

Keno and others added 7 commits June 15, 2017 14:19
Design notes are in the devdocs. Algorithmic documentation in code comments.
Details are in the devdocs. This scheme is signfificantly simpler.
Some LLVM passes don't handle addrspacecast too well, so try to
minimize addrspace cast transitions where legal according to
our invariants.
In the previous iteration of this code, timing would be double counted.
This way, the individual passes are correctly registered with the
top level manager, so timing works properly. It's a bit hacky but
with the legacy pass manager, there's not much else to be done.
Also avoid numbering arguments early in the pipeline. Improves
performance on small test cases without safepoints.
@Keno
Copy link
Member Author

Keno commented Jun 15, 2017

I've squashed some of the cleanup commits in the middle, addressed the latest round of feedback and addressed the latest rounds of comments. There shouldn't really be any changes to the generated code, but I'll let CI run through just in case (also in case there's interactions with master). I do plan to merge this as soon as CI is through though.

GetElementPtrInst *NewGEP = cast<GetElementPtrInst>(GEP->clone());
ToInsert.push_back(std::make_pair(NewGEP, GEP));
Type *GEPTy = GEP->getSourceElementType();
Type *NewRetTy = cast<PointerType>(GEP->getType())->getElementType()->getPointerTo(getValueAddrSpace(CurrentV));
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're putting an addrspace on this, there should be one on the BitCast below too, otherwise the BitCast inst would be attempting to change AddrSpace also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice the adddrspace is always 0 (on non-gpu targets anyway), so it's fine, but sure, I'll add the addrspace to the bitcast as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After merging that is. Otherwise, by the time we're through the CI queue again I'll have to rebase yet again.

Copy link
Contributor

@tkelman tkelman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor typos etc, can be addressed in either a ci skip commit or later followup

jlcall calling convention. This allows us to retain the SSA-ness of the
uses throughout the optimizer. GC root placement will later lower this call to
the original C ABI. In the code the calling convention number is represented by
the `JLCALL_F_CC` constant. In addition, there ist the `JLCALL_CC` calling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is the


## GC root placement

GC root placement is done by an LLVM late in the pass pipeline. Doing GC root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an LLVM pass late in the pipeline

placement this late enables LLVM to make more aggressive optimizations around
code that requires GC roots, as well as allowing us to reduce the number of
required GC roots and GC root store operations (since LLVM doesn't understand
our GC, it wouldn't otherwise know what it is and is not allowed to do with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not allowed to do what?

to always be discardable without altering the semantics of the program. However,
failing to identify a gc-tracked pointer alters the resulting program behavior
dramatically - it'll probably crash or return wrong results. We currently use
three different addressspaces (their numbers are defined in src/codegen_shared.cpp):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

address spaces (used as 2 words in the rest of this paragraph)


First, only the following addressspace casts are allowed
- 0->{Tracked,Derived,CalleeRooted}: It is allowable to decay an untracked pointer to any of the
other. However, do note that the optimizer has broad license to not root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

others


4. GC Root coloring

Two values which are not simulataneously live at a safepoint can share the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simultaneously

Two values which are not simulataneously live at a safepoint can share the
same slot. This is an important optimization, because otherwise long
functions would have exceptionally large GC slots, reducing performance
and bloating the size of the stack. Assigning values to these slots is,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no comma after is

(this is beneficial for code where the primary path does not have
safepoints, but some other path - e.g. the error path does). However,
if the first safepoint is not dominated by the definition (this can
happen due to the non-ssa corner cases), the store is insert right after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is inserted

of the algorithm and their operands as uses of those values. It is
important to consider however WHERE the uses of PHI's operands are
located. It is neither at the start of the basic block, because the values
do not dominated the block (so can't really consider them live-in), nor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not dominate

%Bbase, %B]

We then pretend, for the purposes of numbering that %phi was derived from
%philift. Note that in order to be able to this, we need to be able to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to be able to do this,

@Keno Keno merged commit 42e1f63 into master Jun 16, 2017
@Keno
Copy link
Member Author

Keno commented Jun 16, 2017

Will do various cleanups in a followup PR.

@ararslan ararslan deleted the kf/gcroots branch June 16, 2017 23:05
Keno added a commit that referenced this pull request Jun 17, 2017
As of the merge of #21888, we no longer support building on LLVM <= 3.9.1.
This is essentialyl the mechanical cleanup to rip out all code that
we had to support building on older LLVM versions. There may still be
some residual support left in places and of course some things can now
be cleaned up further, but this should get us started.
Keno added a commit that referenced this pull request Jun 17, 2017
As of the merge of #21888, we no longer support building on LLVM <= 3.9.1.
This is essentially the mechanical cleanup to rip out all code that
we had to support building on older LLVM versions. There may still be
some residual support left in places and of course some things can now
be cleaned up further, but this should get us started.
This was referenced Jun 17, 2017
quinnj pushed a commit that referenced this pull request Jun 20, 2017
As of the merge of #21888, we no longer support building on LLVM <= 3.9.1.
This is essentially the mechanical cleanup to rip out all code that
we had to support building on older LLVM versions. There may still be
some residual support left in places and of course some things can now
be cleaned up further, but this should get us started.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unnecessary gc root slots in generated code Very inefficient GC frame generation
9 participants