Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hide 'align' instruction behind jmp #60787

Merged
merged 24 commits into from
Nov 18, 2021

Conversation

kunalspathak
Copy link
Member

Overview

With current loop alignment, align instruction are placed just before the loop start. This can sometimes affect performance adversely if the block in which align instruction is placed is hot or if it is part of nested loop, in which case processor would perform fetching and decoding of nop. This PR places the align instructions behind unconditional jmp instructions if they exist before the loop that it is trying to align. If no such jmp were present, then align will be placed right before the loop start, the one it is done today.

Here is the sample diff where align instruction was moved from IG31 and placed into IG29 after the jmp instruction.

image

I have also added COMPlus_JitHideAlignBehindJmp to turn off this feature for debugging purpose. In Release, it is always ON. I have also added a stress mode where 50% of time, we would emit INS_BREAKPOINT instead of align in situation where the align are placed behind jmp to make sure that we never execute them.

Design

A new datastructure alignBlocksList is created which is a linked list of all BasicBlock that are the head of loop start needing alignment. This is done during final ref counting phase. During codegen, we pull the basic block from the alignBlocksList and monitor if there is any unconditional jmp. If we find one, we emit align instruction and set the BB flag BBF_LOOP_ALIGN_ADDED. This makes sure that if we see more jmp before the actual loop start, we do not add align instructions again. When we reach a point in the flow where next block is the loop start, we would update the targetIG (see below) of the alignInstr. At this time, we also determine that if we didn't see any jmp so far (using BBF_LOOP_ALIGN_ADDED), we would emit the align instruction. Finally, we move to the next BasicBlock of alignBlocksList.

instrDescAlign data structure has been updated. idaIG field in it now points to the IG that contains the align instruction. This IG can be before the loop or can be some on the previous IG that ends with jmp. idaTargetIG is the IG that earlier used to be idaIG. It points to the IG before the IG that has loop. This is used when we want to calculate the loopSize. Some of the changes are around getting the right field wherever necessary.

IGF_LOOP_ALIGN flag, which previously used to be on an IG just before the loop IG has been replaced by IGF_HAS_ALIGN and will be on IG that contains the align instruction. Again, this may or may not be the one just before the loop IG. Finally, to handle special scenarios where an IG that is part of loop might have align instruction for a different IG, flag IGF_REMOVED_ALIGN is added that tells if the align instruction present in that IG are removed or not.

Impact

Ideally, this change should have been no code size diffs, however the reason behind diffs on x64 is that because of moving the align instruction, we tend to shorten some jumps and that changes the heuristics calculation of alignment. However, as seen below, the impact is minimal and number of unchanged methods are far more than the impacting methods.

As expected, there is no code size difference in arm64 because we just moved the align instruction around.

collection platform main PR diff diff % methods regressed methods improved methods unchanged
libraries.pmi windows.x64 577335 577170 -165 -0.03% 27 40 546
benchmarks.run windows.x64 277173 277211 38 0.01% 19 13 181
coreclr_tests.pmi windows.x64 188577 188598 21 0.01% 9 7 184
aspnet.run windows.x64 113942 113995 53 0.05% 3 3 66
benchmarks.run windows.arm64 201028 201028 0 0.00% 0 0 132
coreclr_tests.pmi windows.arm64 118012 118012 0 0.00% 0 0 234
libraries.pmi windows.arm64 372600 372600 0 0.00% 0 0 405

Detail diffs: https://gist.github.com/kunalspathak/9cc028b60a2e7aba82308fa1e94951ba

Contributes to #43227

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 22, 2021
@ghost
Copy link

ghost commented Oct 22, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Overview

With current loop alignment, align instruction are placed just before the loop start. This can sometimes affect performance adversely if the block in which align instruction is placed is hot or if it is part of nested loop, in which case processor would perform fetching and decoding of nop. This PR places the align instructions behind unconditional jmp instructions if they exist before the loop that it is trying to align. If no such jmp were present, then align will be placed right before the loop start, the one it is done today.

Here is the sample diff where align instruction was moved from IG31 and placed into IG29 after the jmp instruction.

image

I have also added COMPlus_JitHideAlignBehindJmp to turn off this feature for debugging purpose. In Release, it is always ON. I have also added a stress mode where 50% of time, we would emit INS_BREAKPOINT instead of align in situation where the align are placed behind jmp to make sure that we never execute them.

Design

A new datastructure alignBlocksList is created which is a linked list of all BasicBlock that are the head of loop start needing alignment. This is done during final ref counting phase. During codegen, we pull the basic block from the alignBlocksList and monitor if there is any unconditional jmp. If we find one, we emit align instruction and set the BB flag BBF_LOOP_ALIGN_ADDED. This makes sure that if we see more jmp before the actual loop start, we do not add align instructions again. When we reach a point in the flow where next block is the loop start, we would update the targetIG (see below) of the alignInstr. At this time, we also determine that if we didn't see any jmp so far (using BBF_LOOP_ALIGN_ADDED), we would emit the align instruction. Finally, we move to the next BasicBlock of alignBlocksList.

instrDescAlign data structure has been updated. idaIG field in it now points to the IG that contains the align instruction. This IG can be before the loop or can be some on the previous IG that ends with jmp. idaTargetIG is the IG that earlier used to be idaIG. It points to the IG before the IG that has loop. This is used when we want to calculate the loopSize. Some of the changes are around getting the right field wherever necessary.

IGF_LOOP_ALIGN flag, which previously used to be on an IG just before the loop IG has been replaced by IGF_HAS_ALIGN and will be on IG that contains the align instruction. Again, this may or may not be the one just before the loop IG. Finally, to handle special scenarios where an IG that is part of loop might have align instruction for a different IG, flag IGF_REMOVED_ALIGN is added that tells if the align instruction present in that IG are removed or not.

Impact

Ideally, this change should have been no code size diffs, however the reason behind diffs on x64 is that because of moving the align instruction, we tend to shorten some jumps and that changes the heuristics calculation of alignment. However, as seen below, the impact is minimal and number of unchanged methods are far more than the impacting methods.

As expected, there is no code size difference in arm64 because we just moved the align instruction around.

collection platform main PR diff diff % methods regressed methods improved methods unchanged
libraries.pmi windows.x64 577335 577170 -165 -0.03% 27 40 546
benchmarks.run windows.x64 277173 277211 38 0.01% 19 13 181
coreclr_tests.pmi windows.x64 188577 188598 21 0.01% 9 7 184
aspnet.run windows.x64 113942 113995 53 0.05% 3 3 66
benchmarks.run windows.arm64 201028 201028 0 0.00% 0 0 132
coreclr_tests.pmi windows.arm64 118012 118012 0 0.00% 0 0 234
libraries.pmi windows.arm64 372600 372600 0 0.00% 0 0 405

Detail diffs: https://gist.github.com/kunalspathak/9cc028b60a2e7aba82308fa1e94951ba

Contributes to #43227

Author: kunalspathak
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

@BruceForstall BruceForstall self-requested a review October 23, 2021 00:02
Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional questions/comments:

  • Is there a measured perf difference? Or is this speculative, and we'll see what the lab shows?
  • What if there are many "jmp" before the loop to align? Maybe the first one is a poor choice because it is in a hot loop. Or maybe the only one prior is in a hot loop. Maybe we shouldn't move it then, due to impacting the I-cache? Should we look at block weights to decide?
  • It seems like there should be an "alignment planning" pass that decides where to put the alignment instructions, and annotates the BasicBlocks with those decisions. It could both set a flag that an alignment instruction is needed, and include a pointer to the BasicBlock for the loop we are aligning. Then, the codegen loop would just act on those decisions, and be very simple. Interleaving the planning and codegen loops seems complicated. It seems like this might remove the need for the alignBBLists list.
  • Is there an end-to-end design written down in comments in one place? It seems like there should be.

src/coreclr/jit/block.h Outdated Show resolved Hide resolved
src/coreclr/jit/block.h Outdated Show resolved Hide resolved
src/coreclr/jit/compiler.h Outdated Show resolved Hide resolved
Comment on lines 4543 to 4566
#if FEATURE_LOOP_ALIGN
if (calculateAlign)
{
// Track the blocks that needs alignment
// except if it is first block, because currently adding padding
// in prolog is not supported
if (opts.compJitHideAlignBehindJmp && block->isLoopAlign() && (block != fgFirstBB))
{
alignBlocksList* curr = new (this, CMK_FlowList) alignBlocksList(block);

if (alignBBLists == nullptr)
{
alignBBLists = curr;
}
else
{
alignBB->next = curr;
}

alignBB = curr;
}
}
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems very strange to put this code in this function, since it has nothing to do with ref counts. Why is it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't wanted to do another iteration for BasicBlocks because for long methods, it can be costly. With that said, I have created a separate phase for align placement where I now iterate over BasicBlock. However, there is still room for improvement there. In my latest code (which I will push shortly), bbPlaceLoopAlignInstructions(), I want to skip the pass if there are no loops to align. It turns out that the only reliable way to check that after lowering is to check for BB->isLoopAlign(). I would really like to piggy-back in this ref counting method because that is the last thing that is executed before code gen and the state of the BBF_LOOP_ALIGN is accurate at that point and we would save iterating over the BasicBlocks again.

A simple logic in this method like below will help avoid iterating over the basic blocks list for scenarios where we add loop alignment initially but then removing it during flow graph analysis because of loops with calls, loop unrolling, compacting, etc. Currently, I have added a needsLoopAlignment that gets set whenever we mark a loop as BBF_LOOP_ALIGN but will never unset if we unmark any loop for alignment.

needsLoopAlignment |= block->isLoopAlign();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you keep a count of aligned loops and decrement the count when you remove a BBF_LOOP_ALIGN bit, then just check for "loopAlignCount > 0" before running the PlaceLoopAlignment phase?

It's really unpleasant to have unrelated phases tied together in a somewhat implicit contract.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you keep a count of aligned loops and decrement the count when you remove a BBF_LOOP_ALIGN bit, then just check for "loopAlignCount > 0" before running the PlaceLoopAlignment phase?

Yes, that's exactly what I tried doing, but turns out that sometimes a block/loop is marked as not needing alignment multiple times specially in AddContainsCallAllContainingLoops() that miscalculates the count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in AddContainsCallAllContainingLoops() that miscalculates the count.

Why would it do that? Once you've cleared the bit, you wouldn't decrement again. e.g.,

void Compiler::AddContainsCallAllContainingLoops(unsigned lnum)
{

#if FEATURE_LOOP_ALIGN
    // If this is the inner most loop, reset the LOOP_ALIGN flag
    // because a loop having call will not likely to benefit from
    // alignment
    if (optLoopTable[lnum].lpChild == BasicBlock::NOT_IN_LOOP)
    {
        BasicBlock* first = optLoopTable[lnum].lpFirst;
        if (first->isLoopAlign())
        {
            assert(compAlignedLoopCount > 0);
            --compAlignedLoopCount;
            first->bbFlags &= ~BBF_LOOP_ALIGN;
            JITDUMP("Removing LOOP_ALIGN flag for " FMT_LP " that starts at " FMT_BB " because loop has a call.\n", lnum,
                first->bbNum);
         }
    }
#endif
...

src/coreclr/jit/jitconfigvalues.h Outdated Show resolved Hide resolved
#ifdef FEATURE_LOOP_ALIGN
/* Save the prev IG */

emitPrevIG = emitCurIG;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to implement this goal without creating a emitPrevIG "global"?

src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved
Comment on lines 9879 to 10089
//if (emitComp->opts.disAsm)
//{
// emitDispInsAddr(dstRW);

// emitDispInsOffs(0, false);

// printf(" %-9s ; stress-mode injected interrupt\n", "int3");
//}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment? Or uncomment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally left it as commented so during debugging, we could uncomment and see the instruction. Let me know if you think otherwise.

@@ -217,6 +217,10 @@ class CodeGen final : public CodeGenInterface

void genInitializeRegisterState();

#if FEATURE_LOOP_ALIGN
void genMarkBlocksForLoopAlignment();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to delete this.

@kunalspathak
Copy link
Member Author

@BruceForstall - This should be ready for the review. Here are the changes:

  • Added bbPlaceLoopAlignInstructions() pass to label the blocks that should contain align instruction. It prefers cold blocks if there are multiple candidates. Also track loopAlignCandidates and skip this phase if loopAlignCandidates == 0. The code is much simpler now.
  • Removed the need of prevIG that I was tracking for few edge scenarios and instead force a new IG before loop head.
  • I have created loopHeadIG() that retrieves the loop head IG that we are interested in most often. I still need to track the previous IG (idaTargetIG has been renamed to idaLoopHeadPredIG) so I can know after what point I can re-enable the VEX encoding optimization. If we decide to re-enable the VEX encoding after loop head IG, then I noticed that the deoptimized version of VEX encoding is generated for instructions inside the loop.

@kunalspathak
Copy link
Member Author

kunalspathak commented Nov 11, 2021

CodeSize diffs: https://dev.azure.com/dnceng/public/_build/results?buildId=1464807&view=ms.vss-build-web.run-extensions-tab
PerfScore diffs:

Name Diff Methods regressed Methods improved Methods unchanged
benchmarks.run.windows.x64.checked -6.69 28 90 100
libraries.pmi.windows.x64.checked -142.01 59 242 319
aspnet.run.windows.x64.checked -333.71 10 39 28
coreclr_tests.pmi.windows.x64.checked -34.29 52 64 67
benchmarks.run.windows.arm64.checked -2052.67 1 38 83
libraries.pmi.windows.arm64.checked -714.32 13 104 307
coreclr_tests.pmi.windows.arm64.checked 84.04 153 28 38

I analyzed some of the PerfScore regressions on windows/arm64 for coreclr_tests, and they all are coming from moving the align instructions behind jmp and those blocks are expensive. But again, since they won't be executed, it is fine to have that way. Another option that I thought about is to continue placing align in the block just before the loop if that block is colder than the previous blocks that has jmp instruction. But that might incur the cost of fetching and decoding of those instructions if placed in the block just before the loop. Hence, if I see multiple blocks that ends with jmp, I select the coldest block among them instead and don't worry check the weight of the block that precedes the loop body.

@BruceForstall
Copy link
Member

I analyzed some of the PerfScore regressions on windows/arm64 for coreclr_tests, and they all are coming from moving the align instructions behind jmp and those blocks are expensive.

Should PerfScore not count align instructions in an IG following an unconditional branch?

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions

src/coreclr/jit/compiler.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/compiler.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/compiler.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/compiler.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/compiler.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/compiler.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/compiler.h Outdated Show resolved Hide resolved
src/coreclr/jit/emit.cpp Show resolved Hide resolved
src/coreclr/jit/emit.cpp Outdated Show resolved Hide resolved
Comment on lines +9987 to +10089
//
// if (emitComp->opts.disAsm)
//{
// emitDispInsAddr(dstRW);

// emitDispInsOffs(0, false);

// printf(" %-9s ; stress-mode injected interrupt\n", "int3");
//}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it commented intentionally so we can quickly uncomment and check it in disassembly. If you still prefer, I can delete it.

@kunalspathak
Copy link
Member Author

Should PerfScore not count align instructions in an IG following an unconditional branch?

Let me see if it can be done easily.

@kunalspathak
Copy link
Member Author

Let me see if it can be done easily.

Thanks for the suggestion. This gives the real data about PerfScore which is much nicer.

Name PerfScore diff Methods Regressed Methods Improved Methods NoChange
benchmarks.run.windows.x64 -145.12 18 184 16
libraries.pmi.windows.x64 -270.95 31 471 118
aspnet.run.windows.x64 -440.38 6 64 7
coreclr_tests.pmi.windows.x64 -94.79 12 157 14
benchmarks.run.windows.arm64 -2273.58 0 97 25
libraries.pmi.windows.arm64 -1227.70 0 346 78
coreclr_tests.pmi.windows.arm64 -338.33 0 202 17

fix the alignBytesRemoved

Some fixes and working model

Some fixes and redesign

Some more fixes

more fixes

fix

Add the check  for fgFirstBB

misc changes

code cleanup + JitHideAlignBehindJmp switch

validatePadding only if align are before the loop IG

More cleanup, remove commented code

jit format
…st the targetIG to prevIG

Add IGF_REMOVED_ALIGN flag for special scenarios
@kunalspathak
Copy link
Member Author

@BruceForstall - Can you review it again? I think I have addressed all the feedback.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One possible follow-up.

src/coreclr/jit/block.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/compiler.cpp Show resolved Hide resolved
Co-authored-by: Bruce Forstall <brucefo@microsoft.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants