Possible JIT improvement for loops with return statements #7474

bbowyersmyth · 2017-02-21T07:01:33Z

Loops with a return statement in the body run slower than those without. It would be good if the JIT had a way to optimize this. If tracking this was too complex perhaps moving return [true|false] statements could be a starting point.

Test code
https://gist.github.com/bbowyersmyth/9514af463745528d8d290e7cd2492660

The very simple loop runs 85.7ns vs 67.4ns (20% difference). The gap can widen with additional instructions added to the body.
Current theory is that this is due to the CPUs complexity rules for the loop stream detector.

Initial suggestion by @jkotas dotnet/coreclr#2667 (comment)
Recent discussion dotnet/coreclr#9213

cc @mikedn

The text was updated successfully, but these errors were encountered:

benaadams · 2017-02-21T07:36:20Z

The inline return will add a bunch of additional pop statements that need to be jumped over making the body of the loop larger

bbowyersmyth · 2017-02-21T09:43:10Z

Apart from a branch misprediction are there other factors at play that make skipping over multiple instructions more costly than one for a given iteration?

mikedn · 2017-02-21T11:28:35Z

That branch should be correctly predicted so that's not an issue. One possible explanation for the behavior you are seeing is that predicted and taken branches and slightly more expensive than predicated and not taken branches:

G_M11558_IG05:
       0FB708               movzx    rcx, word  ptr [rax]
       440FB70A             movzx    r9, word  ptr [rdx]
       413BC9               cmp      ecx, r9d
; predicted and taken every loop iteration
       7407                 je       SHORT G_M11558_IG07
       33C0                 xor      eax, eax
G_M11558_IG06:
       4883C418             add      rsp, 24
       C3                   ret
G_M11558_IG07:
       4883C002             add      rax, 2
       4883C202             add      rdx, 2
       41FFC8               dec      r8d
       4585C0               test     r8d, r8d
       75DD                 jne      SHORT G_M11558_IG05

; versus
G_M50079_IG05:
       440FB701             movzx    r8, word  ptr [rcx]
       440FB70A             movzx    r9, word  ptr [rdx]
       453BC1               cmp      r8d, r9d
; predicted and NOT taken every loop iteration
       7518                 jne      SHORT G_M50079_IG08
       4883C102             add      rcx, 2
       4883C202             add      rdx, 2
       FFC8                 dec      eax
       85C0                 test     eax, eax
       75E5                 jne      SHORT G_M50079_IG05

Anyway, it's best to pull non-loop code out of loops for various reasons including branch cost and LSD.

- Add System.Memory dependency on System.Vectors for !netstandard10 build configurations - Vectorized SequenceEquals for Span<byte> and Span<char> - Add workarounds for https://github.com/dotnet/coreclr/issues/9692

* Span performance improvements - Add System.Memory dependency on System.Vectors for !netstandard10 build configurations - Vectorized SequenceEquals for Span<byte> and Span<char> - Add workarounds for https://github.com/dotnet/coreclr/issues/9692

Low-tech approach to #9692. Finds forward branches to returns and moves the return block later in the method. Gives better layout in simple cases of search loops that return when they find a result. Won't handle cases where there are multiple non-loop blocks in a loop reachable from just one loop block, or cases where there are non-loop blocks but not returns, say from a `break` or similar construct.

AndyAyersMS · 2017-04-25T19:40:57Z

Some perf results from the low-tech dotnet/coreclr#11192:

Fasta (~6%): moved return block out of loop in SelectRandom
TreeSort (~5%): compacted loop body in Insert somewhat

There is also what looks like a win in some of the Linq Where tests. Hard to be 100% sure based on diffs since dotnet/coreclr#11192 is moving blocks that I don't expect it to move -- likely a limitation of using BBF_BACKWARDS_JUMP as a crude in-loop detector.

The tree sort case is a good example of why something more general is warranted, as there are three loop exits, two series of non-loop blocks and two backedges. There's likely an even bigger win if we can generally move all the non-loop code out. And we'd want something that works for breaks from inner loops in nests and not just for returns.

I'm going to put this up for consideration in 2.1 since it looks like we really should fix this.

bbowyersmyth · 2017-04-25T21:18:36Z

No idea how advanced LSDs are but is it still able to optimize if the body is not in the loop?

redknightlois · 2017-06-09T00:26:03Z

This is a common and measurable optimization on our tightest code. Definitely a win if we can avoid writing so many go-to statements.

Rearrange basic blocks during loop identification so that loop bodies are kept contiguous when possible. Blocks in the lexical range of the loop which do not participate in the flow cycle (which typically correspond to code associated with early exits using `break` or `return`) are moved out below the loop when possible without breaking EH region nesting. The target insertion point, when possible, is chosen to be the first spot below the loop that will not break up fall-through. Layout can significantly affect the performance of loops, particularly small search loops, by avoiding the taken branch on the hot path, improving the locality of the code fetched while iterating the loop, and potentially aiding loop stream detection. Resolves #9692.

danmoseley · 2017-08-20T05:51:11Z

this can be removed
C:\git\coreclr\src\mscorlib\src\System\String.Comparison.cs:
47: goto ReturnCharAMinusCharB; // TODO: Workaround for https://github.com/dotnet/coreclr/issues/9692

JosephTremoulet · 2017-08-20T12:26:56Z

this can be removed
C:\git\coreclr\src\mscorlib\src\System\String.Comparison.cs:
47: goto ReturnCharAMinusCharB; // TODO: Workaround for dotnet/coreclr#9692

See https://github.com/dotnet/coreclr/issues/13466

Remove some `goto`s that were added to work around #9692 (poor code layout for loop exit paths) -- the JIT's layout decisions were improved in dotnet#13314, and these particular `goto`s are no longer needed; crossgen of System.Private.CoreLib now produces the same machine code with or without this change. Part of #13466.

Remove some `goto`s that were added to work around #9692 (poor code layout for loop exit paths) -- the JIT's layout decisions were improved in #13314, and these particular `goto`s are no longer needed; crossgen of System.Private.CoreLib now produces the same machine code with or without this change. Part of #13466.

Remove some `goto`s that were added to work around dotnet/coreclr#9692 (poor code layout for loop exit paths) -- the JIT's layout decisions were improved in dotnet/coreclr#13314, and these particular `goto`s are no longer needed; the same machine code is generated with or without this change. Some `goto`s previously tagged as workarounds for dotnet/coreclr#9692 are still relevant for keeping codesize down pending dotnet/coreclr#13549; update their comments accordingly. Part of #23395.

Remove some `goto`s that were added to work around undesirable jit layout (#9692, fixed in dotnet#13314) and epilog factoring (improved in dotnet#13792 and dotnet#13903), which are no longer needed. Resolves #13466.

Remove some `goto`s that were added to work around undesirable jit layout (#9692, fixed in #13314) and epilog factoring (improved in #13792 and #13903), which are no longer needed. Resolves #13466.

JosephTremoulet closed this as completed in dotnet/coreclr#13314 Aug 18, 2017

EamonNerbonne referenced this issue in EamonNerbonne/anoprsst May 10, 2018

Workaround https://github.com/dotnet/coreclr/issues/9692

ad42ea5

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 2.1.0 milestone Jan 31, 2020

AndyAyersMS mentioned this issue Jan 31, 2020

JIT: Missed bounds check vs Jit32 #8559

Open

ghost locked as resolved and limited conversation to collaborators Dec 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible JIT improvement for loops with return statements #7474

Possible JIT improvement for loops with return statements #7474

bbowyersmyth commented Feb 21, 2017

benaadams commented Feb 21, 2017

bbowyersmyth commented Feb 21, 2017

mikedn commented Feb 21, 2017

AndyAyersMS commented Apr 25, 2017

bbowyersmyth commented Apr 25, 2017

redknightlois commented Jun 9, 2017

danmoseley commented Aug 20, 2017

JosephTremoulet commented Aug 20, 2017

Possible JIT improvement for loops with return statements #7474

Possible JIT improvement for loops with return statements #7474

Comments

bbowyersmyth commented Feb 21, 2017

benaadams commented Feb 21, 2017

bbowyersmyth commented Feb 21, 2017

mikedn commented Feb 21, 2017

AndyAyersMS commented Apr 25, 2017

bbowyersmyth commented Apr 25, 2017

redknightlois commented Jun 9, 2017

danmoseley commented Aug 20, 2017

JosephTremoulet commented Aug 20, 2017