Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop cloning driven by type tests #65206

Open
Tracked by #65342
BruceForstall opened this issue Feb 11, 2022 · 7 comments
Open
Tracked by #65342

Loop cloning driven by type tests #65206

BruceForstall opened this issue Feb 11, 2022 · 7 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@BruceForstall
Copy link
Member

BruceForstall commented Feb 11, 2022

JIT loop cloning creates a "fast path" and a "slow path", where a set of checks is done to choose between them. Currently, those checks determine whether array bounds checks can be statically removed, in which case the fast path does remove some number of bounds checks.

This issue proposes generalizing the cloning conditions and "fast path" optimizations to include type tests.

Guarded devirtualization (GDV) will introduce type tests. If we see one or more of those in a loop and the tests are loop invariant and biased, we should consider adding those as cloning preconditions.

The canonical example is a foreach over an IEnumerable<T>. This will create an iterator object and make repeated interface calls to MoveNext, get_Current, etc.

When there is a dominant collection type (e.g., List<int>), with GDV we will have multiple checks for the type of the enumerator that are highly biased towards the collections’ enumerator type. This enumerator object’s type is a loop invariant.

So a plausible implementation sketch is:

  • Walk loop body looking for loop invariant type tests feeding branches that are highly biased (one successor cold). Collect up all the unique instances.
  • If there are a sufficient number and other things look good, clone the loop.
  • Add those tests to the predicate chain for the hot loop. This may require a null check followed by a type test for each.

Initially we can likely rely on redundant branch optimizations to clean up the now-redundant tests in the fast path and slow path loops, but we could also try cleaning those up during cloning.

A canonical test case would be a no-inline method taking an IEnumerable<int>, say, adding up the elements via foreach and returning the sum.

If we feed that with List<int> we should see dramatically improved performance.
If we feed that with int[] we should also see improved performance, assuming #62457 is resolved.

Baseline would be the performance of the same code where the arg has the concrete collection type and/or a version that iterates manually via for.

Extra credit would be doing similar things if the operation is passed in via Func<T> or similar… (devirtualize and also profile driven delegate opt)


Basics are now implemented via following PRs.

Areas for follow-up:

  • better loop recognition. Note we don't need counted/iterable loops for type test cloning, we just need to know that there is a single entry
  • better invariance analysis. Right now we do a simple IR walk that looks for local assignments.
  • better profile maintenance.
  • optimize the GDV predicates in the fast path loop directly
  • "pessimize" the GDV predicate in the slow path loop directly (always execute the fall back)

category:cq
theme:loop-opt
skill-level:expert
cost:large
impact:large

@BruceForstall BruceForstall added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 11, 2022
@BruceForstall BruceForstall added this to the 7.0.0 milestone Feb 11, 2022
@ghost
Copy link

ghost commented Feb 11, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

JIT loop cloning creates a "fast path" and a "slow path", where a set of checks is done to choose between them. Currently, those checks determine whether array bounds checks can be statically removed, in which case the fast path does remove some number of bounds checks.

This issue proposes generalizing the cloning conditions and "fast path" optimizations to include type tests.

Guarded devirtualization (GDV) will introduce type tests. If we see one or more of those in a loop and the tests are loop invariant and biased, we should consider adding those as cloning preconditions.

The canonical example is a foreach over an IEnumerable<T>. This will create an iterator object and make repeated interface calls to MoveNext, get_Current, etc.

When there is a dominant collection type (e.g., List<int>), with GDV we will have multiple checks for the type of the enumerator that are highly biased towards the collections’ enumerator type. This enumerator object’s type is a loop invariant.

So a plausible implementation sketch is:

  • Walk loop body looking for loop invariant type tests feeding branches that are highly biased (one successor cold). Collect up all the unique instances.
  • If there are a sufficient number and other things look good, clone the loop.
  • Add those tests to the predicate chain for the hot loop. This may require a null check followed by a type test for each.

Initially we can likely rely on redundant branch optimizations to clean up the now-redundant tests in the fast path and slow path loops, but we could also try cleaning those up during cloning.

A canonical test case would be a no-inline method taking an IEnumerable<int>, say, adding up the elements via foreach and returning the sum.

If we feed that with List<int> we should see dramatically improved performance.
If we feed that with int[] we should also see improved performance, assuming #62457 is resolved.

Baseline would be the performance of the same code where the arg has the concrete collection type and/or a version that iterates manually via for.

Extra credit would be doing similar things if the operation is passed in via Func<T> or similar… (devirtualize and also profile driven delegate opt)

Author: BruceForstall
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 11, 2022
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Feb 11, 2022
@AndyAyersMS
Copy link
Member

A simple prototype (that works on at least one example) can be found here: main...AndyAyersMS:CloneForTypeTestNew.

It does not yet directly optimize the fast path loop, but the dominating type test that gets introduced will allow the redundant branch optimizer to (in many case) eliminate the in-branch loop.

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int MapF(I m, int[] xs, int[] ys, int from, int to)
    {
        int r = 0;
        for (int i = from; i < to; i++)
        {
            r += m.F(xs[i], ys[i]);
        }
        return r;
    }
G_M19203_IG02:              ;; offset=001CH
       4533F6               xor      r14d, r14d
       458BF9               mov      r15d, r9d
       443BFB               cmp      r15d, ebx
       0F8DB2000000         jge      G_M19203_IG09
       4885F6               test     rsi, rsi
       7444                 je       SHORT G_M19203_IG05
       4885FF               test     rdi, rdi
       743F                 je       SHORT G_M19203_IG05
       4585FF               test     r15d, r15d
       7C3A                 jl       SHORT G_M19203_IG05
       85DB                 test     ebx, ebx
       7C36                 jl       SHORT G_M19203_IG05
       48BAC83E1A98FA7F0000 mov      rdx, 0x7FFA981A3EC8      ; Add
       48395500             cmp      qword ptr [rbp], rdx ;; ----------------------------  "hoisted" type test
       7526                 jne      SHORT G_M19203_IG05
       395E08               cmp      dword ptr [rsi+8], ebx
       7C21                 jl       SHORT G_M19203_IG05
       395F08               cmp      dword ptr [rdi+8], ebx
       7C1C                 jl       SHORT G_M19203_IG05
						;; bbWeight=1    PerfScore 19.00
G_M19203_IG03:              ;; offset=0058H   ---------------------------- fast loop (no calls, no bounds checks)
       418BD7               mov      edx, r15d
       448B449610           mov      r8d, dword ptr [rsi+4*rdx+16]
       8B549710             mov      edx, dword ptr [rdi+4*rdx+16]
       4403C2               add      r8d, edx
       4503F0               add      r14d, r8d
       41FFC7               inc      r15d
       443BFB               cmp      r15d, ebx
       7CE6                 jl       SHORT G_M19203_IG03
						;; bbWeight=2.97 PerfScore 18.56
G_M19203_IG04:              ;; offset=0072H
       EB69                 jmp      SHORT G_M19203_IG09
						;; bbWeight=1    PerfScore 2.00
G_M19203_IG05:              ;; offset=0074H  ---------------------------- slow loop (retains GDV, bounds checks)
       48BAC83E1A98FA7F0000 mov      rdx, 0x7FFA981A3EC8      ; Add
       48395500             cmp      qword ptr [rbp], rdx
       7520                 jne      SHORT G_M19203_IG07
						;; bbWeight=0.03 PerfScore 0.13
G_M19203_IG06:              ;; offset=0084H
       443B7E08             cmp      r15d, dword ptr [rsi+8]
       7363                 jae      SHORT G_M19203_IG11
       458BC7               mov      r8d, r15d
       468B448610           mov      r8d, dword ptr [rsi+4*r8+16]
       443B7F08             cmp      r15d, dword ptr [rdi+8]
       7355                 jae      SHORT G_M19203_IG11
       418BD7               mov      edx, r15d
       8B549710             mov      edx, dword ptr [rdi+4*rdx+16]
       4403C2               add      r8d, edx
       EB2E                 jmp      SHORT G_M19203_IG08
						;; bbWeight=0.02 PerfScore 0.22
G_M19203_IG07:              ;; offset=00A4H
       443B7E08             cmp      r15d, dword ptr [rsi+8]
       7343                 jae      SHORT G_M19203_IG11
       418BD7               mov      edx, r15d
       8B549610             mov      edx, dword ptr [rsi+4*rdx+16]
       443B7F08             cmp      r15d, dword ptr [rdi+8]
       7336                 jae      SHORT G_M19203_IG11
       458BC7               mov      r8d, r15d
       468B448710           mov      r8d, dword ptr [rdi+4*r8+16]
       488BCD               mov      rcx, rbp
       49BB1000BB97FA7F0000 mov      r11, 0x7FFA97BB0010
       41FF13               call     [r11]I:F(int,int):int:this
       448BC0               mov      r8d, eax
						;; bbWeight=0.02 PerfScore 0.24
G_M19203_IG08:              ;; offset=00D2H
       4503F0               add      r14d, r8d
       41FFC7               inc      r15d
       443BFB               cmp      r15d, ebx
       7C97                 jl       SHORT G_M19203_IG05

Some issues that have come up during prototyping:

  • The block arrangements we do before PGO can interfere with loop recognition. See JIT: PGO-based block reordering interferes with loop recognition #67318.
  • Because these loops will inevitably involve calls, we'll decide initially that they are not worth aligning. But later if we clone then all calls in the fast path loop may be optimized away. So, we should reconsider our strategy for marking loops as alignment-worthy.
  • The types of expressions that we can detect as loop invariant are quite limited (currently just lcl vars and constants). This may or may not prove to be a serious limitation. If we want to consider more complex expressions then we also will have similar complications as are seen in hosting.
  • Likewise the types of loop structures we recognize (even without complications from block reordering) are limited, eg the following sort of loop should in principle be clonable if there is one dominant type but isn't recognized. And (see below) the natural loop recognition we do seems to get confused and sees 4 loops here when there is only 1.
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Sum(IEnumerable<int> e)
    {
        int r = 0;
        foreach(int i in e)
        {
            r += i;
        }
        return r;
    }

image

*************** In optFindLoops()
*************** In optMarkLoopHeads()
4 loop heads marked
*************** In optFindNaturalLoops()
*************** In optFindAndScaleGeneralLoopBlocks()

Marking a loop from BB05 to BB13
    BB05(wt=600); unchanged: has profile weight
    BB06(wt=600); unchanged: has profile weight
    BB07(wt=600); unchanged: has profile weight
    BB08(wt=700); unchanged: has profile weight
    BB09(wt=700); unchanged: has profile weight
    BB10(wt=700); unchanged: has profile weight
    BB11(wt=635.9581); unchanged: has profile weight
    BB12(wt=700); unchanged: has profile weight
    BB13(wt=700); unchanged: has profile weight
Marking a loop from BB07 to BB19
    BB07(wt=600); unchanged: has profile weight
    BB08(wt=700); unchanged: has profile weight
    BB09(wt=700); unchanged: has profile weight
    BB10(wt=700); unchanged: has profile weight
    BB11(wt=635.9581); unchanged: has profile weight
    BB12(wt=700); unchanged: has profile weight
    BB13(wt=700); unchanged: has profile weight
    BB14(wt=100); unchanged: has profile weight
    BB15(wt=64.04188); unchanged: has profile weight
    BB16(wt=64.04188); unchanged: has profile weight
    BB17(wt=0); unchanged: has profile weight
    BB18(wt=0); unchanged: has profile weight
    BB19(wt=0); unchanged: has profile weight
Marking a loop from BB12 to BB16
    BB12(wt=700); unchanged: has profile weight
    BB13(wt=700); unchanged: has profile weight
    BB14(wt=100); unchanged: has profile weight
    BB15(wt=64.04188); unchanged: has profile weight
    BB16(wt=64.04188); unchanged: has profile weight
Marking a loop from BB13 to BB18
    BB13(wt=700); unchanged: has profile weight
    BB14(wt=100); unchanged: has profile weight
    BB15(wt=64.04188); unchanged: has profile weight
    BB16(wt=64.04188); unchanged: has profile weight
    BB17(wt=0); unchanged: has profile weight
    BB18(wt=0); unchanged: has profile weight
Found a total of 4 general loops.
*************** In fgDebugCheckLoopTable

*************** Finishing PHASE Find loops

@AndyAyersMS
Copy link
Member

I suppose at some point we need to also take a closer look at cloniing profitability. Unlike bounds checks, a type test failure is not uncommon. So we ideally only want to clone for highly biased successful test (we're currently willing to introduce GDV even if the type test is likely to fail, as the likelihoodThreshold is 25 or 30 (out of 100).

likelihood = likelyClasses[0].likelihood;
likelyClass = likelyClasses[0].clsHandle;
if (numberOfClasses < 1)
{
JITDUMP("No likely class, sorry\n");
return;
}
assert(likelyClass != NO_CLASS_HANDLE);
// Print all likely classes
JITDUMP("%s classes for %p (%s):\n", doRandomDevirt ? "Random" : "Likely", dspPtr(objClass), objClassName)
for (UINT32 i = 0; i < numberOfClasses; i++)
{
JITDUMP(" %u) %p (%s) [likelihood:%u%%]\n", i + 1, likelyClasses[i].clsHandle,
eeGetClassName(likelyClasses[i].clsHandle), likelyClasses[i].likelihood);
}
// Todo: a more advanced heuristic using likelihood, number of
// classes, and the profile count for this block.
//
// For now we will guess if the likelihood is at least 25%/30% (intfc/virt), as studies
// have shown this transformation should pay off even if we guess wrong sometimes.
//
if (likelihood < likelihoodThreshold)
{
JITDUMP("Not guessing for class; likelihood is below %s call threshold %u\n", callKind, likelihoodThreshold);
return;
}

We'd need to recover this bias information from the edge weights, which are potentially unreliable by this point (the successor blocks of a GDV test should be join-free, so perhaps we can simply look at their weights).

A related issue is that we shouldn't be inspired to clone if the GDV test is unlikely to be executed, so we should look at the relative weight of the GDV test and the loop entry or exit.

And if we ever enable "multi-guess" GDV (eg #59604) we need to decide how to fit that into the scheme too. I don't think we want to make multiple clones if there are several likely GDV targets, but it is something we could consider.

@BruceForstall
Copy link
Member Author

  • Because these loops will inevitably involve calls, we'll decide initially that they are not worth aligning. But later if we clone then all calls in the fast path loop may be optimized away. So, we should reconsider our strategy for marking loops as alignment-worthy.

Some kind of deferred loop alignment computation, like was attempted in #62940, should be done.

@BruceForstall
Copy link
Member Author

Andy, since you're working on this, I'm assigning to you. You can decide when/whether to push to 8.0.0.

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jun 7, 2022
Such as those added by GDV.

The JIT will now clone just for type tests, or just for array bounds, or for
a mixture of the two. The JIT will still produce just one fast and one slow loop.
If there are a mixture of array bounds and type test conditions, all conditions
must pass for control to reach the fast loop.

Unlike array bounds checks, type test failures are not intrinsically rare,
so there is some profitability screening to ensure that a failed type test does
not force execution to run the slow version "too often". The type test must
execute frequently within the loop, and be heavily biased towards success.

This is work towards resolving dotnet#65206.
AndyAyersMS added a commit that referenced this issue Jun 8, 2022
Such as those added by GDV.

The JIT will now clone just for type tests, or just for array bounds, or for
a mixture of the two. The JIT will still produce just one fast and one slow loop.
If there are a mixture of array bounds and type test conditions, all conditions
must pass for control to reach the fast loop.

Unlike array bounds checks, type test failures are not intrinsically rare,
so there is some profitability screening to ensure that a failed type test does
not force execution to run the slow version "too often". The type test must
execute frequently within the loop, and be heavily biased towards success.

This is work towards resolving #65206.
@AndyAyersMS
Copy link
Member

The first bits of this are implemented. It doesn't kick in as widely as I would expect -- likely there are several areas that need to be reworked to expose more opportunities:

  • better loop recognition. Note we don't need counted/iterable loops for type test cloning, we just need to know that there is a single entry
  • better invariance analysis. Right now we do a simple IR walk that looks for local assignments.
  • better profile maintenance.
  • etc

Going to consider these as opportunistic for .NET 7.

@AndyAyersMS AndyAyersMS modified the milestones: 7.0.0, 8.0.0 Jun 10, 2022
@AndyAyersMS AndyAyersMS modified the milestones: 8.0.0, Future Jun 18, 2022
@BruceForstall
Copy link
Member Author

Related: #75140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

2 participants