-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve JIT loop optimizations (.NET 6) #43549
Comments
fyi @dotnet/jit-contrib |
Some candidates for "Missing optimizations" section:
static void Foo3(byte* array, int len, byte val)
{
for (int i = 0; i < len; i++)
array[i] = val;
} is folded into: call void @llvm.memset.p0i8.i64(i8* %array, i8 %val, i64 %len, i32 1, i1 false)
Also, really enjoyed this one:
for: bool is_sorted(int *a, int n) {
for (int i = 0; i < n - 1; i++)
if (a[i] > a[i + 1])
return false;
return true;
} |
An induction analysis case from this thread https://news.ycombinator.com/item?id=13182726: int X (int num) {
int a = 0;
for (int x = 0; x < num; x += 2) {
if (x % 2 != 0) {
a += x;
}
}
return a;
} where gcc and clang with -O2 produce: X(int): # @X(int)
xor eax, eax
ret The key analysis going on here is called scalar evolution (SCEV) in the LLVM community. It is basically just a stronger symbolic induction variable analysis than what is normally needed for traditional loop optimizations. In this case SCEV identifies that "x" starts at 0, increments to num by 2, and is even. It also knows that "a" is a reduction of x. The calculus to simplify this is actually straightforward based on that symbolic information. |
Reminds me an issue we had in |
Just a thought - it would also be nice to add to the strength reduction category the ability for the JIT to move the base indexing calculation outside of the loop body in cases where the range is just over interior references on the same object, so eg. when doing a For reference (just a random example purely to show the loop codegen difference): Foreach loop (click to expand):static int Sum1(Span<int> span)
{
int sum = 0;
foreach (int n in span)
{
sum += n;
}
return sum;
} x64 codegen (click to expand):L0000: xor eax, eax
L0002: mov rdx, [rcx]
L0005: mov ecx, [rcx+8]
L0008: xor r8d, r8d
L000b: test ecx, ecx
L000d: jle short L0021
L000f: movsxd r9, r8d
L0012: mov r9d, [rdx+r9*4]
L0016: add eax, r9d
L0019: inc r8d
L001c: cmp r8d, ecx
L001f: jl short L000f
L0021: ret Manually optimized loop (click to expand):static int Sum2(Span<int> span)
{
ref int rStart = ref MemoryMarshal.GetReference(span);
ref int rEnd = ref Unsafe.Add(ref rStart, span.Length);
int sum = 0;
while (Unsafe.IsAddressLessThan(ref rStart, ref rEnd))
{
sum += rStart;
rStart = ref Unsafe.Add(ref rStart, 1);
}
return sum;
} x64 codegen (click to expand):L0000: mov rax, [rcx]
L0003: mov edx, [rcx+8]
L0006: movsxd rdx, edx
L0009: lea rdx, [rax+rdx*4]
L000d: xor ecx, ecx
L000f: cmp rax, rdx
L0012: jae short L001f
L0014: add ecx, [rax]
L0016: add rax, 4
L001a: cmp rax, rdx
L001d: jb short L0014
L001f: mov eax, ecx
L0021: ret The loop body in this example goes down from: L000f: movsxd r9, r8d
L0012: mov r9d, [rdx+r9*4]
L0016: add eax, r9d
L0019: inc r8d
L001c: cmp r8d, ecx
L001f: jl short L000f To just this: L0014: add ecx, [rax]
L0016: add rax, 4
L001a: cmp rax, rdx
L001d: jb short L0014 We're using this pattern manually in ImageSharp (eg. here and here) and we saw some noticeable performance improvements from applying this optimization alone to many of our inner loops in the various image processors (especially the convolution ones). Hope this helps, this whole issue looks great! 🚀 |
@BruceForstall you merged the code for #6569. Please update the above plan with the status (either checkbox or mark items as Done). |
We don't expect to make significant additional improvements in loop optimizations in .NET 6. This issue will serve as a snapshot of the work that was considered and completed. I've opened a new meta-issue to track loop optimization planning going forward, for .NET 7 and beyond: #55235 |
RyuJIT has several loop optimization phases that have various issues (both correctness and performance) and can be significantly improved. RyuJIT also lacks some loop optimizations that have been shown to benefit various use cases. For .NET 6 the
proposed work is fixing and improving the existing phases and collecting information and developing a plan for adding
the missing phases.
Existing Optimizations
Below is a list of the existing loop-related RyuJIT phases and a short description of the improvement opportunities.
Loop Recognition
RyuJIT currently has lexical-based loop recognition and only recognizes natural loops. We should consider replacing it with a standard Tarjan SCC algorithm that classifies all loops. Then we can extend some loop optimizations to also work on non-natural loops.
Even if we continue to use the current algorithm, we should verify that it catches the maximal set of natural loops; it is believed that it misses some natural loops.
Loop Inversion
"While" loops are transformed to "do-while" loops to save one branch in the loop. Some issues have been identified with
heuristics for this optimization.
Loop Cloning
This optimization creates two copies of a loop: one with bounds checks and one without bounds checks and executes one of them at runtime based on some condition. Several issues have been identified with this optimizations. One recurring theme is unnecessary loop cloning where we first clone a loop and then eliminate range checks from both copies.
Loop Unrolling
The existing phase only does full unrolls, and only for SIMD loops: current heuristic is that the loop bounds test must be a SIMD element count. The impact of the optimization is currently very limited but in general it's a high-impact optimization with the right heuristics.
Loop Invariant Code Hoisting
This phase attempts to hoist code that will produce the same value on each iteration of the loop to the pre-header. There is
at least one (and likely more) correctness issue:
And multiple issues about limitations of the algorithm:
Loop optimization hygiene
Loop optimizations need to work well with the rest of the compiler phases and IR invariants, such as with PGO.
Missing Optimizations
Several major optimizations are missing even though we have evidence of their effectiveness (at least on microbenchmarks).
Induction Variable Widening
Induction variable widening eliminates unnecessary widening converts from int32 sized induction variables to int64 size address mode register uses. On AMD64, this eliminates unnecessary
movsxd
instructions prior to array dereferencing.Strength Reduction
Strength reduction replaces expensive operations with equivalent but less expensive operations.
Loop Unswitching
Loop unswitching moves a conditional from inside a loop to outside of it by duplicating the loop's body, and placing a version of the loop inside each of the
if
andelse
clauses of the conditional. It has elements of both Loop Cloning and Loop Invariant Code Motion.Loop Interchange
Loop interchange swaps an inner and outer loop to provide follow-on optimization opportunities.
Benefits
It's easy to show the benefit of improved loop optimizations on microbenchmarks. For example, the team has done analysis of JIT microbenchmarks (benchstones, SciMark, etc.) several years ago. The analysis contains estimates of perf improvement from several of these optimizations (each is low single digit %). Real code is also likely to have hot loops that will benefit from improved loop optimizations.
The benchmarks and other metrics we will measure to show the benefits is TBD.
Proposed work
category:planning
theme:loop-opt
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: