-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loop cloning driven by type tests #65206
Comments
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsJIT loop cloning creates a "fast path" and a "slow path", where a set of checks is done to choose between them. Currently, those checks determine whether array bounds checks can be statically removed, in which case the fast path does remove some number of bounds checks. This issue proposes generalizing the cloning conditions and "fast path" optimizations to include type tests. Guarded devirtualization (GDV) will introduce type tests. If we see one or more of those in a loop and the tests are loop invariant and biased, we should consider adding those as cloning preconditions. The canonical example is a When there is a dominant collection type (e.g., So a plausible implementation sketch is:
Initially we can likely rely on redundant branch optimizations to clean up the now-redundant tests in the fast path and slow path loops, but we could also try cleaning those up during cloning. A canonical test case would be a no-inline method taking an If we feed that with Baseline would be the performance of the same code where the arg has the concrete collection type and/or a version that iterates manually via Extra credit would be doing similar things if the operation is passed in via
|
A simple prototype (that works on at least one example) can be found here: main...AndyAyersMS:CloneForTypeTestNew. It does not yet directly optimize the fast path loop, but the dominating type test that gets introduced will allow the redundant branch optimizer to (in many case) eliminate the in-branch loop. [MethodImpl(MethodImplOptions.NoInlining)]
public static int MapF(I m, int[] xs, int[] ys, int from, int to)
{
int r = 0;
for (int i = from; i < to; i++)
{
r += m.F(xs[i], ys[i]);
}
return r;
} G_M19203_IG02: ;; offset=001CH
4533F6 xor r14d, r14d
458BF9 mov r15d, r9d
443BFB cmp r15d, ebx
0F8DB2000000 jge G_M19203_IG09
4885F6 test rsi, rsi
7444 je SHORT G_M19203_IG05
4885FF test rdi, rdi
743F je SHORT G_M19203_IG05
4585FF test r15d, r15d
7C3A jl SHORT G_M19203_IG05
85DB test ebx, ebx
7C36 jl SHORT G_M19203_IG05
48BAC83E1A98FA7F0000 mov rdx, 0x7FFA981A3EC8 ; Add
48395500 cmp qword ptr [rbp], rdx ;; ---------------------------- "hoisted" type test
7526 jne SHORT G_M19203_IG05
395E08 cmp dword ptr [rsi+8], ebx
7C21 jl SHORT G_M19203_IG05
395F08 cmp dword ptr [rdi+8], ebx
7C1C jl SHORT G_M19203_IG05
;; bbWeight=1 PerfScore 19.00
G_M19203_IG03: ;; offset=0058H ---------------------------- fast loop (no calls, no bounds checks)
418BD7 mov edx, r15d
448B449610 mov r8d, dword ptr [rsi+4*rdx+16]
8B549710 mov edx, dword ptr [rdi+4*rdx+16]
4403C2 add r8d, edx
4503F0 add r14d, r8d
41FFC7 inc r15d
443BFB cmp r15d, ebx
7CE6 jl SHORT G_M19203_IG03
;; bbWeight=2.97 PerfScore 18.56
G_M19203_IG04: ;; offset=0072H
EB69 jmp SHORT G_M19203_IG09
;; bbWeight=1 PerfScore 2.00
G_M19203_IG05: ;; offset=0074H ---------------------------- slow loop (retains GDV, bounds checks)
48BAC83E1A98FA7F0000 mov rdx, 0x7FFA981A3EC8 ; Add
48395500 cmp qword ptr [rbp], rdx
7520 jne SHORT G_M19203_IG07
;; bbWeight=0.03 PerfScore 0.13
G_M19203_IG06: ;; offset=0084H
443B7E08 cmp r15d, dword ptr [rsi+8]
7363 jae SHORT G_M19203_IG11
458BC7 mov r8d, r15d
468B448610 mov r8d, dword ptr [rsi+4*r8+16]
443B7F08 cmp r15d, dword ptr [rdi+8]
7355 jae SHORT G_M19203_IG11
418BD7 mov edx, r15d
8B549710 mov edx, dword ptr [rdi+4*rdx+16]
4403C2 add r8d, edx
EB2E jmp SHORT G_M19203_IG08
;; bbWeight=0.02 PerfScore 0.22
G_M19203_IG07: ;; offset=00A4H
443B7E08 cmp r15d, dword ptr [rsi+8]
7343 jae SHORT G_M19203_IG11
418BD7 mov edx, r15d
8B549610 mov edx, dword ptr [rsi+4*rdx+16]
443B7F08 cmp r15d, dword ptr [rdi+8]
7336 jae SHORT G_M19203_IG11
458BC7 mov r8d, r15d
468B448710 mov r8d, dword ptr [rdi+4*r8+16]
488BCD mov rcx, rbp
49BB1000BB97FA7F0000 mov r11, 0x7FFA97BB0010
41FF13 call [r11]I:F(int,int):int:this
448BC0 mov r8d, eax
;; bbWeight=0.02 PerfScore 0.24
G_M19203_IG08: ;; offset=00D2H
4503F0 add r14d, r8d
41FFC7 inc r15d
443BFB cmp r15d, ebx
7C97 jl SHORT G_M19203_IG05 Some issues that have come up during prototyping:
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Sum(IEnumerable<int> e)
{
int r = 0;
foreach(int i in e)
{
r += i;
}
return r;
}
|
I suppose at some point we need to also take a closer look at cloniing profitability. Unlike bounds checks, a type test failure is not uncommon. So we ideally only want to clone for highly biased successful test (we're currently willing to introduce GDV even if the type test is likely to fail, as the runtime/src/coreclr/jit/importer.cpp Lines 21853 to 21882 in a0d7212
We'd need to recover this bias information from the edge weights, which are potentially unreliable by this point (the successor blocks of a GDV test should be join-free, so perhaps we can simply look at their weights). A related issue is that we shouldn't be inspired to clone if the GDV test is unlikely to be executed, so we should look at the relative weight of the GDV test and the loop entry or exit. And if we ever enable "multi-guess" GDV (eg #59604) we need to decide how to fit that into the scheme too. I don't think we want to make multiple clones if there are several likely GDV targets, but it is something we could consider. |
Some kind of deferred loop alignment computation, like was attempted in #62940, should be done. |
Andy, since you're working on this, I'm assigning to you. You can decide when/whether to push to 8.0.0. |
Such as those added by GDV. The JIT will now clone just for type tests, or just for array bounds, or for a mixture of the two. The JIT will still produce just one fast and one slow loop. If there are a mixture of array bounds and type test conditions, all conditions must pass for control to reach the fast loop. Unlike array bounds checks, type test failures are not intrinsically rare, so there is some profitability screening to ensure that a failed type test does not force execution to run the slow version "too often". The type test must execute frequently within the loop, and be heavily biased towards success. This is work towards resolving dotnet#65206.
Such as those added by GDV. The JIT will now clone just for type tests, or just for array bounds, or for a mixture of the two. The JIT will still produce just one fast and one slow loop. If there are a mixture of array bounds and type test conditions, all conditions must pass for control to reach the fast loop. Unlike array bounds checks, type test failures are not intrinsically rare, so there is some profitability screening to ensure that a failed type test does not force execution to run the slow version "too often". The type test must execute frequently within the loop, and be heavily biased towards success. This is work towards resolving #65206.
The first bits of this are implemented. It doesn't kick in as widely as I would expect -- likely there are several areas that need to be reworked to expose more opportunities:
Going to consider these as opportunistic for .NET 7. |
Related: #75140 |
JIT loop cloning creates a "fast path" and a "slow path", where a set of checks is done to choose between them. Currently, those checks determine whether array bounds checks can be statically removed, in which case the fast path does remove some number of bounds checks.
This issue proposes generalizing the cloning conditions and "fast path" optimizations to include type tests.
Guarded devirtualization (GDV) will introduce type tests. If we see one or more of those in a loop and the tests are loop invariant and biased, we should consider adding those as cloning preconditions.
The canonical example is a
foreach
over anIEnumerable<T>
. This will create an iterator object and make repeated interface calls toMoveNext
,get_Current
, etc.When there is a dominant collection type (e.g.,
List<int>
), with GDV we will have multiple checks for the type of the enumerator that are highly biased towards the collections’ enumerator type. This enumerator object’s type is a loop invariant.So a plausible implementation sketch is:
Initially we can likely rely on redundant branch optimizations to clean up the now-redundant tests in the fast path and slow path loops, but we could also try cleaning those up during cloning.
A canonical test case would be a no-inline method taking an
IEnumerable<int>
, say, adding up the elements viaforeach
and returning the sum.If we feed that with
List<int>
we should see dramatically improved performance.If we feed that with
int[]
we should also see improved performance, assuming #62457 is resolved.Baseline would be the performance of the same code where the arg has the concrete collection type and/or a version that iterates manually via
for
.Extra credit would be doing similar things if the operation is passed in via
Func<T>
or similar… (devirtualize and also profile driven delegate opt)Basics are now implemented via following PRs.
Areas for follow-up:
category:cq
theme:loop-opt
skill-level:expert
cost:large
impact:large
The text was updated successfully, but these errors were encountered: