Improve JIT loop optimizations (.NET 6) #43549

BruceForstall · 2020-10-17T00:15:53Z

RyuJIT has several loop optimization phases that have various issues (both correctness and performance) and can be significantly improved. RyuJIT also lacks some loop optimizations that have been shown to benefit various use cases. For .NET 6 the
proposed work is fixing and improving the existing phases and collecting information and developing a plan for adding
the missing phases.

Existing Optimizations

Below is a list of the existing loop-related RyuJIT phases and a short description of the improvement opportunities.

Loop Recognition

RyuJIT currently has lexical-based loop recognition and only recognizes natural loops. We should consider replacing it with a standard Tarjan SCC algorithm that classifies all loops. Then we can extend some loop optimizations to also work on non-natural loops.

Even if we continue to use the current algorithm, we should verify that it catches the maximal set of natural loops; it is believed that it misses some natural loops.

Certain loops do not get recorded in optLoopTable #43713 Describes two cases where loops are missed due to various issues.

Loop Inversion

"While" loops are transformed to "do-while" loops to save one branch in the loop. Some issues have been identified with
heuristics for this optimization.

JIT: heuristics in optInvertWhileLoop may be overly conservative #6569 JIT: heuristics in fgOptWhileLoop may be overly conservative
Iterating with ForEach over ImmutableArray is slower than over Array #780 The issue contains a link to a prototype tweaking the heuristics and some code size numbers and analysis
Improve loop inversion #52347 Improve cases where the loop condition block (the entry block) has multiple non-loop predecessors.

Loop Cloning

This optimization creates two copies of a loop: one with bounds checks and one without bounds checks and executes one of them at runtime based on some condition. Several issues have been identified with this optimizations. One recurring theme is unnecessary loop cloning where we first clone a loop and then eliminate range checks from both copies.

RyuJIT's loop cloning optimization has questionable CQ #4929 RyuJIT's loop cloning optimization has questionable CQ
JIT: examples where loop cloning is not useful #8558 JIT: examples where loop cloning is not useful
Poor loop optimization in BilinearInterpol benchmark #31831 Poor loop optimization in BilinearInterpol benchmark
LoopCloneContext::EvaluateConditions need to evaluate for const init, limit condition. #10314 LoopCloneContext::EvaluateConditions need to evaluate for const init, limit condition.
[Mostly done] loop cloning and pgo #48850 loop cloning and pgo. Remaining: use PGO data to influence cost/benefit analysis of deciding to clone a loop.
If compReturnBB is unreachable we should remove it #48740 (comment) Poor tracking of return blocks impacts loop cloning
[Stretch goal] Support loop cloning with struct arrays #48897 Support loop cloning with struct arrays
Consider hoisting of class init checks for loop cloning and inversion #49102 Consider hoisting of class init checks for loop cloning and inversion

Loop Unrolling

The existing phase only does full unrolls, and only for SIMD loops: current heuristic is that the loop bounds test must be a SIMD element count. The impact of the optimization is currently very limited but in general it's a high-impact optimization with the right heuristics.

Loop unrolling support in RyuJIT #4248 Loop unrolling support in RyuJIT
JIT optimization: loop unrolling #8107 JIT optimization: loop unrolling
Loop Unrolling is not Enabled in Release Build #41063 Loop Unrolling is not Enabled in Release Build

Loop Invariant Code Hoisting

This phase attempts to hoist code that will produce the same value on each iteration of the loop to the pre-header. There is
at least one (and likely more) correctness issue:

JIT: Loop hoisting re-ordering exceptions #6639 JIT: Loop hoisting re-ordering exceptions

And multiple issues about limitations of the algorithm:

JIT: limitations in hoisting (loop invariant code motion) #35735 JIT: limitations in hoisting (loop invariant code motion)
JIT: Loop hoisting inhibited by phase-ordering issue #6554 JIT: Loop hoisting inhibited by phase-ordering issue
RyuJIT: Loop hoist invariant struct field accesses #7265 RyuJIT: Loop hoist invariant struct field accesses
RyuJIT: missed opportunity for LICM #6666 RyuJIT: missed opportunity for LICM

Loop optimization hygiene

Loop optimizations need to work well with the rest of the compiler phases and IR invariants, such as with PGO.

[Mostly done] Loop opts should not be recomputing pred lists from scratch #49030 Loop opts should not be recomputing pred lists from scratch. Remaining phases to fix: optFindNaturalLoops, optUnrollLoops, fgInsertGCPolls.

Missing Optimizations

Several major optimizations are missing even though we have evidence of their effectiveness (at least on microbenchmarks).

Induction Variable Widening

Induction variable widening eliminates unnecessary widening converts from int32 sized induction variables to int64 size address mode register uses. On AMD64, this eliminates unnecessary movsxd instructions prior to array dereferencing.

RyuJIT: Index Variable Widening optimization for array accesses #7312 RyuJIT: Index Variable Widening optimization for array accesses

Strength Reduction

Strength reduction replaces expensive operations with equivalent but less expensive operations.

Strength reduction for add operations performed power of 2 times #34938 Strength reduction for add operations performed power of 2 times
ARM64: loop array indexing inefficiencies #34810 ARM64: loop array indexing inefficiencies

Loop Unswitching

Loop unswitching moves a conditional from inside a loop to outside of it by duplicating the loop's body, and placing a version of the loop inside each of the if and else clauses of the conditional. It has elements of both Loop Cloning and Loop Invariant Code Motion.

Loop Interchange

Loop interchange swaps an inner and outer loop to provide follow-on optimization opportunities.

JIT: loop interchange optimization #4358 JIT: loop interchange optimization

Benefits

It's easy to show the benefit of improved loop optimizations on microbenchmarks. For example, the team has done analysis of JIT microbenchmarks (benchstones, SciMark, etc.) several years ago. The analysis contains estimates of perf improvement from several of these optimizations (each is low single digit %). Real code is also likely to have hot loops that will benefit from improved loop optimizations.

The benchmarks and other metrics we will measure to show the benefits is TBD.

Proposed work

Do analysis of hot loops in important workloads (ASP.NET, etc.)
Use the findings along with the existing microbenchmark analysis to prioritize loop optimizations work
Fix the known issues in the existing loop optimizations starting with the more impactful ones as determined by the previous two items.
- Determine if current loop recognition and loop structure representation needs to be revamped to be more general and allow for more powerful optimizations.
- Recommend starting with Loop Cloning and Loop Invariant Code Hoisting as there are well-understood weaknesses and improvement opportunities in those phases.
Evaluate the use of SSA in loop optimizations. Perhaps a better representation of heap locations in SSA will make it more useful for loop optimizations.
Create a plan for adding missing optimizations

category:planning
theme:loop-opt
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

BruceForstall · 2020-10-17T00:16:11Z

fyi @dotnet/jit-contrib

EgorBo · 2020-10-17T20:59:09Z

Some candidates for "Missing optimizations" section:

Elimination of loops with empty bodys (probably covered by Induction Variable)
Loop vectorization (relies on Loop Unrolling)
Loop Idiom Recognition - some loops can be folded into memset/memcpy/popcount (gc pauses?)
e.g. currently in Mono-LLVM the following C# code:

    static void Foo3(byte* array, int len, byte val)
    {
        for (int i = 0; i < len; i++)
            array[i] = val;
    }

is folded into:

    call void @llvm.memset.p0i8.i64(i8* %array, i8 %val, i64 %len, i32 1, i1 false)

Reverse loops, see godbolt: https://godbolt.org/z/3xPqxc

Also, really enjoyed this one:

Here GVN has figured out that loading a[i] is unnecessary because a[i + 1] from one loop iteration can be forwarded to the next iteration as a[i].

for:

bool is_sorted(int *a, int n) {
  for (int i = 0; i < n - 1; i++)
    if (a[i] > a[i + 1])
      return false;
  return true;
}

(c) https://blog.regehr.org/archives/1603

am11 · 2020-10-17T23:00:14Z

An induction analysis case from this thread https://news.ycombinator.com/item?id=13182726:

int X (int num) {
    int a = 0;
    for (int x = 0; x < num; x += 2) {
        if (x % 2 != 0) {
            a += x;
        }
    }
    return a;
}

where gcc and clang with -O2 produce:

X(int): # @X(int)
xor eax, eax
ret

The key analysis going on here is called scalar evolution (SCEV) in the LLVM community. It is basically just a stronger symbolic induction variable analysis than what is normally needed for traditional loop optimizations. In this case SCEV identifies that "x" starts at 0, increments to num by 2, and is even. It also knows that "a" is a reduction of x. The calculus to simplify this is actually straightforward based on that symbolic information.

EgorBo · 2020-10-18T09:40:09Z

An induction analysis case from this thread https://news.ycombinator.com/item?id=13182726:

Reminds me an issue we had in dotnet/performance repo where LLVM managed to fold the whole benchmark into xor eax, eax :D https://godbolt.org/z/qffPv8 and here is the fix for that benchmark: https://github.com/dotnet/performance/pull/960/files

Sergio0694 · 2021-02-02T11:53:15Z

Just a thought - it would also be nice to add to the strength reduction category the ability for the JIT to move the base indexing calculation outside of the loop body in cases where the range is just over interior references on the same object, so eg. when doing a foreach on an array or a Span<T> or ReadOnlySpan<T>. We mentioned this a while back on Discord with @AndyAyersMS and I remember @EgorBo commented on this on Twitter too, though rightfully saying doing this manually makes the code quite terrible to read and to maintain, so it'd be great to have this built-in as I think LLVM does as well 😄

For reference (just a random example purely to show the loop codegen difference):

Foreach loop (click to expand):

static int Sum1(Span<int> span)
{
    int sum = 0;
    
    foreach (int n in span)
    {
        sum += n;
    }
    
    return sum;
}

x64 codegen (click to expand):

L0000: xor eax, eax
L0002: mov rdx, [rcx]
L0005: mov ecx, [rcx+8]
L0008: xor r8d, r8d
L000b: test ecx, ecx
L000d: jle short L0021
L000f: movsxd r9, r8d
L0012: mov r9d, [rdx+r9*4]
L0016: add eax, r9d
L0019: inc r8d
L001c: cmp r8d, ecx
L001f: jl short L000f
L0021: ret

Manually optimized loop (click to expand):

static int Sum2(Span<int> span)
{
    ref int rStart = ref MemoryMarshal.GetReference(span);
    ref int rEnd = ref Unsafe.Add(ref rStart, span.Length);
    int sum = 0;
    
    while (Unsafe.IsAddressLessThan(ref rStart, ref rEnd))
    {
        sum += rStart;
        rStart = ref Unsafe.Add(ref rStart, 1);
    }
    
    return sum;
}

x64 codegen (click to expand):

L0000: mov rax, [rcx]
L0003: mov edx, [rcx+8]
L0006: movsxd rdx, edx
L0009: lea rdx, [rax+rdx*4]
L000d: xor ecx, ecx
L000f: cmp rax, rdx
L0012: jae short L001f
L0014: add ecx, [rax]
L0016: add rax, 4
L001a: cmp rax, rdx
L001d: jb short L0014
L001f: mov eax, ecx
L0021: ret

The loop body in this example goes down from:

L000f: movsxd r9, r8d
L0012: mov r9d, [rdx+r9*4]
L0016: add eax, r9d
L0019: inc r8d
L001c: cmp r8d, ecx
L001f: jl short L000f

To just this:

L0014: add ecx, [rax]
L0016: add rax, 4
L001a: cmp rax, rdx
L001d: jb short L0014

We're using this pattern manually in ImageSharp (eg. here and here) and we saw some noticeable performance improvements from applying this optimization alone to many of our inner loops in the various image processors (especially the convolution ones).

Hope this helps, this whole issue looks great! 🚀

JulieLeeMSFT · 2021-04-13T06:13:43Z

@BruceForstall you merged the code for #6569. Please update the above plan with the status (either checkbox or mark items as Done).

BruceForstall · 2021-07-06T23:18:01Z

We don't expect to make significant additional improvements in loop optimizations in .NET 6. This issue will serve as a snapshot of the work that was considered and completed. I've opened a new meta-issue to track loop optimization planning going forward, for .NET 7 and beyond: #55235

BruceForstall added Epic Groups multiple user stories. Can be grouped under a theme. area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Oct 17, 2020

BruceForstall added this to the 6.0.0 milestone Oct 17, 2020

BruceForstall self-assigned this Oct 17, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 17, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Oct 17, 2020

JulieLeeMSFT added Team Epic and removed Epic Groups multiple user stories. Can be grouped under a theme. labels Oct 19, 2020

kunalspathak mentioned this issue Oct 22, 2020

Certain loops do not get recorded in optLoopTable #43713

Closed

EgorBo mentioned this issue Nov 15, 2020

Handle non-ASCII strings in GetNonRandomizedHashCodeOrdinalIgnoreCase #44688

Merged

JulieLeeMSFT added User Story A single user-facing feature. Can be grouped under an epic. Bottom Up Work Not part of a theme, epic, or user story and removed Team Epic labels Nov 16, 2020

tannergooding mentioned this issue Jan 25, 2021

Fix unrolling for LocateLastFoundChar and LocateLastFoundByte #46977

Merged

BruceForstall mentioned this issue Mar 23, 2021

Comments and cleanup for loop cloning #49768

Merged

This was referenced Apr 27, 2021

Loop opts should not be recomputing pred lists from scratch #49030

Closed

What's new in .NET 6 Preview 4 dotnet/core#6098

Closed

JulieLeeMSFT mentioned this issue Jun 8, 2021

What's new in .NET 6 Preview 5 dotnet/core#6099

Closed

BruceForstall changed the title ~~Improve JIT loop optimizations~~ Improve JIT loop optimizations (.NET 5) Jul 6, 2021

BruceForstall changed the title ~~Improve JIT loop optimizations (.NET 5)~~ Improve JIT loop optimizations (.NET 6) Jul 6, 2021

BruceForstall closed this as completed Jul 6, 2021

ghost locked as resolved and limited conversation to collaborators Aug 7, 2021

JulieLeeMSFT added this to .NET Core CodeGen Jun 5, 2024

JulieLeeMSFT moved this to Done in .NET Core CodeGen Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve JIT loop optimizations (.NET 6) #43549

Improve JIT loop optimizations (.NET 6) #43549

BruceForstall commented Oct 17, 2020 •

edited

Loading

BruceForstall commented Oct 17, 2020

EgorBo commented Oct 17, 2020 •

edited

Loading

am11 commented Oct 17, 2020

EgorBo commented Oct 18, 2020 •

edited

Loading

Sergio0694 commented Feb 2, 2021

JulieLeeMSFT commented Apr 13, 2021

BruceForstall commented Jul 6, 2021

Improve JIT loop optimizations (.NET 6) #43549

Improve JIT loop optimizations (.NET 6) #43549

Comments

BruceForstall commented Oct 17, 2020 • edited Loading

Existing Optimizations

Loop Recognition

Loop Inversion

Loop Cloning

Loop Unrolling

Loop Invariant Code Hoisting

Loop optimization hygiene

Missing Optimizations

Induction Variable Widening

Strength Reduction

Loop Unswitching

Loop Interchange

Benefits

Proposed work

BruceForstall commented Oct 17, 2020

EgorBo commented Oct 17, 2020 • edited Loading

am11 commented Oct 17, 2020

EgorBo commented Oct 18, 2020 • edited Loading

Sergio0694 commented Feb 2, 2021

JulieLeeMSFT commented Apr 13, 2021

BruceForstall commented Jul 6, 2021

BruceForstall commented Oct 17, 2020 •

edited

Loading

EgorBo commented Oct 17, 2020 •

edited

Loading

EgorBo commented Oct 18, 2020 •

edited

Loading