Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Refactor init/copy block codegen #27035

Merged
merged 5 commits into from
Oct 16, 2019
Merged

Refactor init/copy block codegen #27035

merged 5 commits into from
Oct 16, 2019

Conversation

mikedn
Copy link

@mikedn mikedn commented Oct 4, 2019

Extracted from #21711 for easier handling.

A bunch of cleanup to make the init/copy block codegen code more consistent, readable and easier to change. One notable change is that GenTreeAddrMode is no longer a large node, it didn't need be. This simplifies things a bit because the original ADD node can be directly turned into a LEA node, instead of creating a new node and having to find the use and update it.

// object). This is not a "regular" indirection.
// (Note that the user check could be costly.)
LIR::Use indirUse;
if (BlockRange().TryGetUse(indir, &indirUse) && indirUse.User()->OperIsIndir())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of this code is responsible for the sole diff in this PR:

--- D:\Projects\jit\diffs\dasmset_16\base\System.Private.CoreLib.dasm	2019-10-04 20:51:01.000000000 +-0300
+++ D:\Projects\jit\diffs\dasmset_16\diff\System.Private.CoreLib.dasm	2019-10-04 20:51:01.000000000 +-0300
@@ -439051,6 +439051,5 @@
 G_M6933_IG02:
-       add      rdx, r8
-       movups   xmm0, xmmword ptr [rdx]
+       movups   xmm0, xmmword ptr [rdx+r8]
        movups   xmmword ptr [rcx], xmm0
        mov      rax, rcx

ContainCheckHWIntrinsicAddr calls TryCreateAddrMode with isIndir = true but then TryCreateAddrMode decides that it knows better...


dispIns(id);
emitCurIGsize += sz;
emitIns_R_ARX(ins, attr, reg, base, REG_NA, 1, disp);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ton of duplicated code...

}
blkNode->gtBlkOpKind = GenTreeBlk::BlkOpKindUnroll;

src->AsIntCon()->SetIconValue(fill);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value wasn't updated when it was 0, even though the value that was checked for wasn't the original value but a masked one...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, so if src->AsIntCon()->IconValue() == 0x11100 we would get a wrong init?
We should be able to repro it with CEE_INITBLK il.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, I changed because it simply was confusing - lowering was masking and checking for 0 and codegen was NOT masking but also checking for 0. But yes, it turns out that it is a real bug on ARM:

        52822000          movz    w0, #0x1100
        72A00020          movk    w0, #1 LSL #16
        9101C3A1          add     x1, fp, #112  // [V05 loc0]
        B9000020          str     w0, [x1]
        39001020          strb    w0, [x1,#4]

XARCH lowering was correct but also did things differently - masking, multiplying & updating, checking for 0.

{
initVal->gtIntCon.gtIconVal = 0x01010101 * fill;
fill *= 0x0101010101010101LL;
src->gtType = TYP_LONG;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the type of the constant but doesn't change the type of GT_INIT_VAL. Doesn't seem to have any ill side effects so I left it unchanged. Anyway I think that the type of GT_INIT_VAL should really be TYP_STRUCT.

{
initVal->gtIntCon.gtIconVal = 0x01010101 * fill;
fill *= 0x0101010101010101LL;
src->gtType = TYP_LONG;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the type was changed to long even when the fill was 0. That isn't necessary and resulted in xor rax, rax rather than xor eax, eax being generated in disassembly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see diffs from changing the handling of 0? Seems like that would be pretty common

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only textual diffs. There are hacks in the emitter that prevented a 64 bit xor from being emitted in the first place, like here:

coreclr/src/jit/emitxarch.cpp

Lines 11360 to 11366 in 088299e

case EA_8BYTE:
// TODO-AMD64-CQ: Better way to not emit REX.W when we don't need it
// Don't need to zero out the high bits explicitly
if ((ins != INS_xor) || (reg1 != reg2))
{
code = AddRexWPrefix(ins, code);
}

// so we need to check if there's a long enough sequence of non-GC slots.
ClassLayout* layout = blkNode->GetLayout();
unsigned slots = layout->GetSlotCount();
for (unsigned i = 0; i < slots; i++)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use 3 loops when you can do it with one?!

addr->ClearContained();
}
}
else if (!source->IsMultiRegCall() && !source->OperIsSimdOrHWintrinsic())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multireg call/simd/hw intrinsic does not ever happen.

@BruceForstall
Copy link
Member

@dotnet/jit-contrib

Copy link

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for extracting this out of #21711 .
Some questions about the first 8 commits here, will return later.

I think you could speed up review/merge of these changes if cut them on even smaller parts (for example 7ed454a could be merged right now if you extract it into a separate PR, 9ab3434 also could be merged and it doesn't look like part of "Refactor init/copy block codegen").

src/jit/emitxarch.cpp Outdated Show resolved Hide resolved

assert(dstAddr->isUsedFromReg());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you decided to delete these asserts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genConsumeReg eventually asserts if there's no register assigned to the node

assert(tree->gtHasReg());

So the assert is redundant. In general the JIT codebase seems to be having a problem with assert redundancy.

src/jit/codegenarmarch.cpp Outdated Show resolved Hide resolved
src/jit/codegenarmarch.cpp Outdated Show resolved Hide resolved
src/jit/codegenarmarch.cpp Outdated Show resolved Hide resolved
src/jit/lowerxarch.cpp Show resolved Hide resolved
src/jit/lowerxarch.cpp Show resolved Hide resolved
src/jit/lsraarmarch.cpp Show resolved Hide resolved
src/jit/lsraarmarch.cpp Show resolved Hide resolved
}
#endif
buildInternalIntRegisterDefForNode(blkNode, regMask);
}

if (size >= XMM_REGSIZE_BYTES)
{
// If we have a buffer larger than XMM_REGSIZE_BYTES,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these comments wrong or unnecessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commenting throughout the JIT codebase is in general pretty terrible. The first rule about commenting is that comments should explain why the code does something when that isn't obvious and NOT what the code does. What the code does should be obvious from, well, the code itself.

Now, in complex code it's not always possible to write clear enough code. Or sometimes the code is clear but is large and saying a few words about what it does is helpful because it speeds up understanding of the code.

That's not the case here. It's 3 lines of trivial code and 5 lines of comments.

@mikedn
Copy link
Author

mikedn commented Oct 8, 2019

I think you could speed up review/merge of these changes if cut them on even smaller parts

Sorry about this monster PR. Initially I was just looking to get the destination address of init block contained because that was already done for copy block and was missing from init block for no obvious reason.

Unfortunately lowering, lsra build & codegen are rather tighly coupled so I had to investigate what needs changing and then things spiralled out of control. Unused variables, confusing names, duplicated code, code inconsistencies for now reason and whatnot.

@mikedn
Copy link
Author

mikedn commented Oct 8, 2019

I think you could speed up review/merge of these changes if cut them on even smaller parts

BTW - it's probably not too late to do that. There are some 3 commits near the end that would definitely fit well in 2 separate PRs. Preferences?

@sandreenko
Copy link

BTW - it's probably not too late to do that. There are some 3 commits near the end that would definitely fit well in 2 separate PRs. Preferences?

It would be nice, in additional I think it is useful to get independent ci testing for each of them to make sure that they don't hide regressions from each others.

Copy link

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more questions/comments.

}
blkNode->gtBlkOpKind = GenTreeBlk::BlkOpKindUnroll;

src->AsIntCon()->SetIconValue(fill);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, so if src->AsIntCon()->IconValue() == 0x11100 we would get a wrong init?
We should be able to repro it with CEE_INITBLK il.


#ifdef DEBUG
// CpObj must always have at least one GC-Pointer as a member.
assert(objNode->GetLayout()->HasGCPtr());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, before we were checking:

if (blkNode->OperGet() == GT_STORE_OBJ)
  assert(blkNode->AsObj()->GetLayout()->HasGCPtr());

now there is

if (blkNode->OperIs(GT_STORE_OBJ) && (!blkNode->AsObj()->GetLayout()->HasGCPtr() || blkNode->gtBlkOpGcUnsafe))

so if before all GT_STORE_OBJ had at least one GCPtr then who passes !blkNode->AsObj()->GetLayout()->HasGCPtr() check after that change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice catch. This change shouldn't probably be in this PR, though it's small and doesn't hurt.

One of the removed commits (mikedn@8336b44) is moving the code from fgMorphUnsafeBlk to lowering. That's responsible for changing OBJs without GC pointers into BLKs.

I'm also working on #27053 that will make BLKs even more rare. Since introducing ClassLayout and making OBJ a small node there's no longer any need to discard struct type information from IR. We should use OBJ whenever type information is available, with or without GC pointers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's still the case that after morph we have no GT_OBJ nodes that have zero gc pointers. So I'm not sure why this would change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently GT_OBJ nodes always have GC pointers. The plan is to use GT_OBJ nodes even in cases when no GC pointers are present, at least until lowering. After lowering it probably doesn't matter anymore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - in that case perhaps we don't need to change it.

src/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved
// b) The size of the struct to initialize is smaller than INITBLK_UNROLL_LIMIT bytes.
void CodeGen::genCodeForInitBlkUnroll(GenTreeBlk* initBlkNode)
{
#ifndef _TARGET_ARM64_

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not #ifdef _TARGET_ARM_?
are you going to implement this function later?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the ifdef will just disappear in #21711. Besides, I find it convenient to place the small block before the big block when you some sort of if/else construct. You don't need to scroll through the entire big block to see what the small block is doing.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions and some minor comment nits.

{
dstAddrBaseReg = genConsumeReg(dstAddr);

// TODO-Cleanup: This matches the old code behavior in order to get 0 diffs. It's otherwise

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something you're planning to change? Should there be an issue?
Also, it should be TODO-ARMArch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will be fixed in the PR from where all this cleanup was extracted from: 1004804

src/jit/codegenxarch.cpp Outdated Show resolved Hide resolved
src/jit/codegenxarch.cpp Outdated Show resolved Hide resolved

#ifdef DEBUG
// CpObj must always have at least one GC-Pointer as a member.
assert(objNode->GetLayout()->HasGCPtr());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's still the case that after morph we have no GT_OBJ nodes that have zero gc pointers. So I'm not sure why this would change.

src/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved

size_t classSize = objNode->GetLayout()->GetSize();
size_t blkSize = roundUp(classSize, TARGET_POINTER_SIZE);
if (blkNode->OperIs(GT_STORE_OBJ) && (!blkNode->AsObj()->GetLayout()->HasGCPtr() || blkNode->gtBlkOpGcUnsafe))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this check was here before. It should probably be asserting !blkNode->AsObj()->GetLayout()->HasGCPtr(), but perhaps that's best left for another PR.

{
// TODO-CQ: We could unroll even when the initialization value is not a constant

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: TODO-XArch-CQ

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is in lowerxarch it is not arch specific, you can always insert a mul x, 0x01010101 to deal with non-const initialization. Not sure if anyone is doing this, though since initblk was exposed in the Unsafe class perhaps someone will complain about it.

{
initVal->gtIntCon.gtIconVal = 0x01010101 * fill;
fill *= 0x0101010101010101LL;
src->gtType = TYP_LONG;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see diffs from changing the handling of 0? Seems like that would be pretty common

@mikedn
Copy link
Author

mikedn commented Oct 10, 2019

I haven't rebased this yet in case there is more feedback.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@sandreenko
Copy link

/azp run outerloop

@azure-pipelines
Copy link

Pull request contains merge conflicts.

@sandreenko
Copy link

Unfortunatle I won't be able to do a review until Monday, could you please do a rebase now so we can check pri1 and stress jobs?

@mikedn
Copy link
Author

mikedn commented Oct 10, 2019

Thanks, conflicts fixed. Just gtRegNum/GetRegNum stuff.

@sandreenko
Copy link

/azp run outerloop

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mikedn
Copy link
Author

mikedn commented Oct 10, 2019

Huh, the assert I added to check for address containment fired. I'd say that the assert is in itself correct, what's dubious is the IR that gets produced from this piece of IL:

    [ 0]   0 (0x000) ldloca.s 0
    [ 1]   2 (0x002) ldarg.0
    [ 2]   3 (0x003) ldflda 04000002
    [ 2]   8 (0x008) ldc.i4.s 28
    [ 3]  10 (0x00a) cpblk

is imported as:

               [000007] -A-XG-------              *  ASG       struct (copy)
               [000006] -------N----              +--*  BLK(28)   struct
               [000001] ------------              |  \--*  ADDR      byref 
               [000000] ------------              |     \--*  LCL_VAR   struct V02 loc0         
               [000003] ---XG--N----              \--*  FIELD     int    f2
               [000002] ------------                 \--*  LCL_VAR   byref  V01 arg0         

which eventually ends up as

N001 (  1,  1) [000002] ------------         t2 =    LCL_VAR   byref  V01 arg0         u:1 (last use) $81
N002 (  1,  1) [000014] ------------        t14 =    CNS_INT   long   4 field offset Fseq[f2] $100
                                                  /--*  t2     byref  
                                                  +--*  t14    long   
N003 (  2,  2) [000015] -------N----        t15 = *  ADD       byref  $140
                                                  /--*  t15    byref  
N004 (  4,  4) [000003] *--XG--N----         t3 = *  IND       int    <l:$241, c:$240>
N005 (  3,  2) [000000] U------N----         t0 =    LCL_VAR_ADDR byref  V02 loc0         ud:1->2
                                                  /--*  t0     byref  
                                                  +--*  t3     int    
N007 (  6,  5) [000006] nA-XG-------              *  STORE_BLK(28) struct (copy)

And the indir address gets contained because its type is TYP_INT. IMO we should not end up with such a struct/block-op/primitive mix. The importer (and possibly block copy morph) should be fixed.

Interestingly, I've run into a somewhat related issue with initobj recently - that gets blindly imported as BLK, even if the class handle is a primitive or reference type. And attempting to generate an OBJ instead of BLK like I'm trying to do in another PR results in the creation of a ClassLayout object for a ref type. Fun.

@mikedn
Copy link
Author

mikedn commented Oct 11, 2019

However, I still think that at least the first 4 commits are independent enough to be merged separately in 4 independent PRs

Sure, I've no problem splitting this even further. The main reason the original PR got so big is that for a long time nothing was merged and it would have been difficult for me to keep multiple PRs coordinated. Not to mention that the CI just isn't reliable, more builds -> more chance of random failures.

@sandreenko
Copy link

Sure, I've no problem splitting this even further. The main reason the original PR got so big is that for a long time nothing was merged and it would have been difficult for me to keep multiple PRs coordinated. Not to mention that the CI just isn't reliable, more builds -> more chance of random failures.

That is true, sorry about that, but we are trying to do these things better now...

Normally 0 sized block ops should be simply removed during lowering. But these are rare enough that adding special code to deal with them is questionable.

Instead unroll such block ops, the unroll codegen simply won't generate any code for a 0 sized block op. Of course, some registers will still be allocated.
Non-GC GT_STORE_OBJ nodes were changed to GT_STORE_BLK in block copy ops
but not in block init ops. Since block init is effectvely non-GC it is
preferable to be consistent and change to GT_STORE_BLK in both cases.
@sandreenko
Copy link

/azp list

@azure-pipelines
Copy link

@sandreenko
Copy link

/azp run coreclr-outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question about removing a TODO comment


if (blkNode->gtBlkOpKind == GenTreeBlk::BlkOpKindUnroll)
{
// TODO-ARM-CQ: Currently we generate a helper call for every

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you removed this TODO?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NYI_ARM bellow should suffice. And anyway I'll enable unroll on ARM in the "improve" PR: d871e37

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can live with that, though believe it or not, I do occasionally grep for the TODOs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have that too but in lowering:

#endif // _TARGET_ARM64_
{
// TODO-ARM-CQ: Currently we generate a helper call for every initblk we encounter.
// Later on we should implement loop unrolling code sequences to improve CQ.
blkNode->gtBlkOpKind = GenTreeBlk::BlkOpKindHelper;
}

Copy link

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks again for splitting that.

Note: the stress testing looks greener than I expected it (no infra/timeouts)

@sandreenko
Copy link

Is it worth recollecting diffs now?

@mikedn
Copy link
Author

mikedn commented Oct 16, 2019

I've ran x64/x86 and altjit arm32/arm64 diffs again - none. The one I mentioned in the first comment in this issue was coming from one of commits that was removed from this PR.

Looking through failed jitstress tests, looks like stuff with a history...

@sandreenko
Copy link

I've ran x64/x86 and altjit arm32/arm64 diffs again - none. The one I mentioned in the first comment in this issue was coming from one of commits that was removed from this PR.

I expected that Use GT_STORE_BLK/unroll consistently and Treat 0 sized block ops as unrolled commits could generate diffs. Doesn't it just happen in framework libraries or am I wrong that they could?

@mikedn
Copy link
Author

mikedn commented Oct 16, 2019

Use GT_STORE_BLK/unroll consistently

Hmm, why? The inconsistency was that init block codegen accepted both BLK and OBJ and generated the same code in both cases while copy block also accepted both but generated different code. The change simply makes init block uses only BLK.

Treat 0 sized block ops as unrolled

Yes, that would generate diffs if you run into such a case. Let me look for a test that does this, I'm pretty sure there's at least one.

@sandreenko
Copy link

Hmm, why? The inconsistency was that init block codegen accepted both BLK and OBJ and generated the same code in both cases while copy block also accepted both but generated different code. The change simply makes init block uses only BLK.

Got it, I had the impression that we were calling emitDisableGC in init block for OBJ. Thanks.

@mikedn
Copy link
Author

mikedn commented Oct 16, 2019

Here's a 0 sized block test that @CarolEidt recently added while fixing a bug: https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Regression/JitBlue/GitHub_24846/GitHub_24846.cs

TestCopy produces the best diff:

--- D:\Projects\jit\diffs\dasmset_1\base\test.dasm	2019-10-16 21:18:05.000000000 +-0300
+++ D:\Projects\jit\diffs\dasmset_1\diff\test.dasm	2019-10-16 21:18:05.000000000 +-0300
@@ -15,14 +15,4 @@
 G_M38074_IG01:
-       push     rdi
-       push     rsi
-       mov      rdi, rcx
 
 G_M38074_IG02:
-       mov      rsi, rdx
-       xor      ecx, ecx
-       rep movsb 
 
-G_M38074_IG03:
-       pop      rsi
-       pop      rdi
        ret      

Since unrolled block copy does not place any restrictions on the source and destination registers the JIT doesn't even need to move the registers so no code is being generated.

The block init case is more interesting - you can see how registers are still getting allocated even if no actual initialization code is generated:

--- D:\Projects\jit\diffs\dasmset_1\base\test.dasm	2019-10-16 21:18:05.000000000 +-0300
+++ D:\Projects\jit\diffs\dasmset_1\diff\test.dasm	2019-10-16 21:18:05.000000000 +-0300
@@ -160,6 +145,4 @@
 G_M7755_IG05:
-       xor      rsi, rsi
-       mov      rdi, qword ptr [rbp-20H]
-       xor      ecx, ecx
-       rep movsb 
+       xor      rax, rax
+       mov      rdx, qword ptr [rbp-18H]

At least the xor could probably be removed by making the initialization value contained - since no initialization is done a register is not needed. But that's just more code to handle a very rare case.

@mikedn
Copy link
Author

mikedn commented Oct 16, 2019

Got it, I had the impression that we were calling emitDisableGC in init block for OBJ. Thanks.

Something like that actually happens but it's a different issue and the fix is in one of the removed commits. Anyway, it doesn't result in actual code differences, you'll only see less labels in diffs due to the GC non-interruptible IGs created by the emitter for disable/enable GC.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants