JIT: PGO-based block reordering interferes with loop recognition #67318

AndyAyersMS · 2022-03-30T00:58:33Z

We currently rearrange the flow graph based on profile weights before we perform loop recognition. Since this recognition is lexical, rearrangement can impair loop recognition.

Here's a simple example:

;; COMPlus_TieredPGO=1
;; COMPlus_TC_QuickJitForLoops=0
;; (optionally use release runtime, as checked runtimes now change tiering parameters to cause tiering up sooner)

using System;
using System.Runtime.CompilerServices;
using System.Threading;

interface I
{
    public int F(int x, int y);
}

class Add : I 
{
    int I.F(int x, int y) => x + y;
}

class Mul : I
{
    int I.F(int x, int y) => x * y;
}

class X
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int MapF(I m, int[] xs, int[] ys, int from, int to)
    {
        int r = 0;
        for (int i = from; i < to; i++)
        {
            r += m.F(xs[i], ys[i]);
        }
        return r;
    }

    public static int Main()
    {
        int[] xs = new int[] { 1, 2, 3, 4 };
        int[] ys = new int[] { 4, 3, 2, 1 };
        I m = new Add();
        I mm = new Mul();

        int r = 0;

        // comment these two out to create a heavily biased GDV
        r += MapF(mm, xs, ys, 0, 3);
        r -= MapF(mm, xs, ys, 0, 3);

        for (int i = 0; i < 30; i++)
        {
            r += MapF(m, xs, ys, 0, 3);
            Thread.Sleep(15);
        }

        Thread.Sleep(50);

        for (int i = 0; i < 70; i++)
        {
            r += MapF(m, xs, ys, 0, 3);
        }

        return r / 20;
    }
}

As is we produce the flow graph on the left; with the two MapF(mm, ...) calls commented out the graph on the right:

Because of this reordering, loop recognition arrives at very different results, even though the flow graphs are isomorphic:

;;;;;;;;;;;;;;;;; left example

*************** Starting PHASE Find loops
*************** In optFindLoops()
*************** In optMarkLoopHeads()
1 loop heads marked
*************** In optFindNaturalLoops()
New loop epoch 1
L00: setting LPFLG_ITER
Recorded loop L00, from BB02 to BB05 (Head=BB01, Entry=BB02, Exit=BB05) [over V06 (ADD 1) LT V04]

***************  Natural loop table
L00, from BB02 to BB05 (Head=BB01, Entry=BB02, Exit=BB05) [over V06 (ADD 1) LT V04]

*************** In optFindAndScaleGeneralLoopBlocks()

Marking a loop from BB02 to BB05
    BB02(wt=300); unchanged: has profile weight
    BB03(wt=150); unchanged: has profile weight
    BB04(wt=150); unchanged: has profile weight
    BB05(wt=300); unchanged: has profile weight
Found a total of 1 general loops.
Skip alignment for L00 that starts at BB02 weight=300.
*************** In fgDebugCheckLoopTable

*************** Finishing PHASE Find loops

;;;;;;;;;;;;;;;;; right example

*************** Starting PHASE Find loops
*************** In optFindLoops()
*************** In optMarkLoopHeads()
2 loop heads marked
*************** In optFindNaturalLoops()
*************** In optFindAndScaleGeneralLoopBlocks()

Marking a loop from BB02 to BB04
    BB02(wt=300); unchanged: has profile weight
    BB03(wt=300); unchanged: has profile weight
    BB04(wt=300); unchanged: has profile weight
Marking a loop from BB04 to BB06
    BB04(wt=300); unchanged: has profile weight
    BB05(wt=100); unchanged: has profile weight
    BB06(wt=0); unchanged: has profile weight
Found a total of 2 general loops.
*************** In fgDebugCheckLoopTable

*************** Finishing PHASE Find loops

We actually see a high volume of monomorphic call sites with PGO, meaning we will quite often be seeing profiles like the one at right where the GDV test is highly biased and the residual indirect call site is cold.

Seems like we should look into deferring block reordering until after optimization.

cc @BruceForstall @EgorBo

The text was updated successfully, but these errors were encountered:

ghost · 2022-03-30T00:58:39Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

We currently rearrange the flow graph based on profile weights before we perform loop recognition. Since this recognition is lexical, rearrangement can impair loop recognition.

Here's a simple example:

;; COMPlus_TieredPGO=1
;; COMPlus_TC_QuickJitForLoops=0
;; (optionally use release runtime, as checked runtimes now change tiering parameters to cause tiering up sooner)

using System;
using System.Runtime.CompilerServices;
using System.Threading;

interface I
{
    public int F(int x, int y);
}

class Add : I 
{
    int I.F(int x, int y) => x + y;
}

class Mul : I
{
    int I.F(int x, int y) => x * y;
}

class X
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int MapF(I m, int[] xs, int[] ys, int from, int to)
    {
        int r = 0;
        for (int i = from; i < to; i++)
        {
            r += m.F(xs[i], ys[i]);
        }
        return r;
    }

    public static int Main()
    {
        int[] xs = new int[] { 1, 2, 3, 4 };
        int[] ys = new int[] { 4, 3, 2, 1 };
        I m = new Add();
        I mm = new Mul();

        int r = 0;

        // comment these two out to create a heavily biased GDV
        r += MapF(mm, xs, ys, 0, 3);
        r -= MapF(mm, xs, ys, 0, 3);

        for (int i = 0; i < 30; i++)
        {
            r += MapF(m, xs, ys, 0, 3);
            Thread.Sleep(15);
        }

        Thread.Sleep(50);

        for (int i = 0; i < 70; i++)
        {
            r += MapF(m, xs, ys, 0, 3);
        }

        return r / 20;
    }
}

As is we produce the flow graph on the left; with the two MapF(mm, ...) calls commented out the graph on the right:

Because of this reordering, loop recognition arrives at very different results, even though the flow graphs are isomorphic:

;;;;;;;;;;;;;;;;; left example

*************** Starting PHASE Find loops
*************** In optFindLoops()
*************** In optMarkLoopHeads()
1 loop heads marked
*************** In optFindNaturalLoops()
New loop epoch 1
L00: setting LPFLG_ITER
Recorded loop L00, from BB02 to BB05 (Head=BB01, Entry=BB02, Exit=BB05) [over V06 (ADD 1) LT V04]

***************  Natural loop table
L00, from BB02 to BB05 (Head=BB01, Entry=BB02, Exit=BB05) [over V06 (ADD 1) LT V04]

*************** In optFindAndScaleGeneralLoopBlocks()

Marking a loop from BB02 to BB05
    BB02(wt=300); unchanged: has profile weight
    BB03(wt=150); unchanged: has profile weight
    BB04(wt=150); unchanged: has profile weight
    BB05(wt=300); unchanged: has profile weight
Found a total of 1 general loops.
Skip alignment for L00 that starts at BB02 weight=300.
*************** In fgDebugCheckLoopTable

*************** Finishing PHASE Find loops

;;;;;;;;;;;;;;;;; right example

*************** Starting PHASE Find loops
*************** In optFindLoops()
*************** In optMarkLoopHeads()
2 loop heads marked
*************** In optFindNaturalLoops()
*************** In optFindAndScaleGeneralLoopBlocks()

Marking a loop from BB02 to BB04
    BB02(wt=300); unchanged: has profile weight
    BB03(wt=300); unchanged: has profile weight
    BB04(wt=300); unchanged: has profile weight
Marking a loop from BB04 to BB06
    BB04(wt=300); unchanged: has profile weight
    BB05(wt=100); unchanged: has profile weight
    BB06(wt=0); unchanged: has profile weight
Found a total of 2 general loops.
*************** In fgDebugCheckLoopTable

*************** Finishing PHASE Find loops

We actually see a high volume of monomorphic call sites with PGO, meaning we will quite often be seeing profiles like the one at right where the GDV test is highly biased and the residual indirect call site is cold.

Seems like we should look into deferring block reordering until after optimization.

cc @BruceForstall @EgorBo

Author:	AndyAyersMS
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `untriaged`
Milestone:	-

AndyAyersMS · 2022-03-31T22:08:04Z

In principle all we have to do here is move optOptimizeLayout later in the phase ordering. Experimenting with this I am seeing various asserts as later phases either have implicit or (quasi) explicit dependences on derived graph info, like

implicitly assuming that block numbers represent sequential block order
dominators have been computed

So, it's going to take a bit of work to tease apart what is an essential constrant or to make the implicit more explicit. Initial results from a CQ/Code Size standpoint look very promising, so this is worth pushing on.

I am wondering if we can get out of the strong dependence on bbNum by relating some of the derived data to bbId instead, so that something like renumbering blocks does not invalidate dominators. Possibly more radical than it sounds.

AndyAyersMS · 2022-03-31T22:10:17Z

Also, some subset of #66998 is likely explained by the above.

AndyAyersMS · 2022-04-29T19:01:26Z

Note we can't (currently) update loop recognition to handle non-contiguous loop bodies because it uses the block extent in various ways for correctness, eg

runtime/src/coreclr/jit/optimizer.cpp

Lines 5596 to 5635 in daec9dc

    
           bool Compiler::optIsVarAssigned(BasicBlock* beg, BasicBlock* end, GenTree* skip, unsigned var) 
        
           { 
        
               bool         result; 
        
               isVarAssgDsc desc; 
        
               desc.ivaSkip = skip; 
        
           #ifdef DEBUG 
        
               desc.ivaSelf = &desc; 
        
           #endif 
        
               desc.ivaVar      = var; 
        
               desc.ivaMaskCall = CALLINT_NONE; 
        
               AllVarSetOps::AssignNoCopy(this, desc.ivaMaskVal, AllVarSetOps::MakeEmpty(this)); 
        
               for (;;) 
        
               { 
        
                   noway_assert(beg != nullptr); 
        
                   for (Statement* const stmt : beg->Statements()) 
        
                   { 
        
                       if (fgWalkTreePre(stmt->GetRootNodePointer(), optIsVarAssgCB, &desc) != WALK_CONTINUE) 
        
                       { 
        
                           result = true; 
        
                           goto DONE; 
        
                       } 
        
                   } 
        
                   if (beg == end) 
        
                   { 
        
                       break; 
        
                   } 
        
                   beg = beg->bbNext; 
        
               } 
        
               result = false; 
        
           DONE: 
        
               return result; 
        
           }

AndyAyersMS · 2022-05-02T18:21:44Z

Also see #59087 (comment).

I am going to retitle this issue since it's more broadly about block reordering because of PGO interfering with loop optimization, and not restricted to GDV.

The JIT currently will aggressively reorder the flow graph before running its loop recognition phases. When there is PGO data this sometimes perturbs the block order so that loops are no longer recognized, and we miss out on some loop optimizations. This change defers most block reordering until after the JIT has gone through the optimization phases. There is still a limited form of flow cleanup done early on. There is also a compensating change in loop recognition in one place where it was relying on adjacent blocks being combined. Fixes dotnet#67318.

…69878) The JIT currently will aggressively reorder the flow graph before running its loop recognition phases. When there is PGO data this sometimes perturbs the block order so that loops are no longer recognized, and we miss out on some loop optimizations. This change defers most block reordering until after the JIT has gone through the optimization phases. There is still a limited form of flow cleanup done early on. There is also a compensating change in loop recognition in one place where it was relying on adjacent blocks being combined. Fixes #67318.

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Mar 30, 2022

AndyAyersMS mentioned this issue Mar 30, 2022

Loop cloning driven by type tests #65206

Open

BruceForstall mentioned this issue Mar 31, 2022

Improve JIT loop optimizations #65342

Open

20 tasks

JulieLeeMSFT added this to the Future milestone Apr 4, 2022

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Apr 4, 2022

This was referenced Apr 25, 2022

Regressions in System.Text.Json.Document.Tests.Perf_EnumerateArray #67176

Closed

Add support for delegate GDV and method-based vtable GDV #68703

Merged

AndyAyersMS changed the title ~~Poor interaction between loop recognition and GDV caused by early block reordering~~ JIT: PGO-based block reordering interferes with loop recognition May 2, 2022

AndyAyersMS modified the milestones: Future, 7.0.0 May 2, 2022

AndyAyersMS self-assigned this May 2, 2022

AndyAyersMS mentioned this issue May 26, 2022

JIT: defer some flow graph reordering until after loop recognition #69878

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label May 26, 2022

AndyAyersMS closed this as completed in #69878 May 30, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 30, 2022

ghost locked as resolved and limited conversation to collaborators Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: PGO-based block reordering interferes with loop recognition #67318

JIT: PGO-based block reordering interferes with loop recognition #67318

AndyAyersMS commented Mar 30, 2022

ghost commented Mar 30, 2022

AndyAyersMS commented Mar 31, 2022

AndyAyersMS commented Mar 31, 2022

AndyAyersMS commented Apr 29, 2022

AndyAyersMS commented May 2, 2022

JIT: PGO-based block reordering interferes with loop recognition #67318

JIT: PGO-based block reordering interferes with loop recognition #67318

Comments

AndyAyersMS commented Mar 30, 2022

ghost commented Mar 30, 2022

AndyAyersMS commented Mar 31, 2022

AndyAyersMS commented Mar 31, 2022

AndyAyersMS commented Apr 29, 2022

AndyAyersMS commented May 2, 2022