Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible JIT improvement for loops with return statements #7474

Closed
bbowyersmyth opened this issue Feb 21, 2017 · 8 comments
Closed

Possible JIT improvement for loops with return statements #7474

bbowyersmyth opened this issue Feb 21, 2017 · 8 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@bbowyersmyth
Copy link
Contributor

Loops with a return statement in the body run slower than those without. It would be good if the JIT had a way to optimize this. If tracking this was too complex perhaps moving return [true|false] statements could be a starting point.

Test code
https://gist.github.com/bbowyersmyth/9514af463745528d8d290e7cd2492660

The very simple loop runs 85.7ns vs 67.4ns (20% difference). The gap can widen with additional instructions added to the body.
Current theory is that this is due to the CPUs complexity rules for the loop stream detector.

Initial suggestion by @jkotas dotnet/coreclr#2667 (comment)
Recent discussion dotnet/coreclr#9213

cc @mikedn

@benaadams
Copy link
Member

The inline return will add a bunch of additional pop statements that need to be jumped over making the body of the loop larger

@bbowyersmyth
Copy link
Contributor Author

Apart from a branch misprediction are there other factors at play that make skipping over multiple instructions more costly than one for a given iteration?

@mikedn
Copy link
Contributor

mikedn commented Feb 21, 2017

That branch should be correctly predicted so that's not an issue. One possible explanation for the behavior you are seeing is that predicted and taken branches and slightly more expensive than predicated and not taken branches:

G_M11558_IG05:
       0FB708               movzx    rcx, word  ptr [rax]
       440FB70A             movzx    r9, word  ptr [rdx]
       413BC9               cmp      ecx, r9d
; predicted and taken every loop iteration
       7407                 je       SHORT G_M11558_IG07
       33C0                 xor      eax, eax
G_M11558_IG06:
       4883C418             add      rsp, 24
       C3                   ret
G_M11558_IG07:
       4883C002             add      rax, 2
       4883C202             add      rdx, 2
       41FFC8               dec      r8d
       4585C0               test     r8d, r8d
       75DD                 jne      SHORT G_M11558_IG05

; versus
G_M50079_IG05:
       440FB701             movzx    r8, word  ptr [rcx]
       440FB70A             movzx    r9, word  ptr [rdx]
       453BC1               cmp      r8d, r9d
; predicted and NOT taken every loop iteration
       7518                 jne      SHORT G_M50079_IG08
       4883C102             add      rcx, 2
       4883C202             add      rdx, 2
       FFC8                 dec      eax
       85C0                 test     eax, eax
       75E5                 jne      SHORT G_M50079_IG05

Anyway, it's best to pull non-loop code out of loops for various reasons including branch cost and LSD.

jkotas referenced this issue in jkotas/corefx Mar 12, 2017
- Add System.Memory dependency on System.Vectors for !netstandard10 build configurations
- Vectorized SequenceEquals for Span<byte> and Span<char>
- Add workarounds for https://github.com/dotnet/coreclr/issues/9692
jkotas referenced this issue in dotnet/corefx Mar 13, 2017
* Span performance improvements
- Add System.Memory dependency on System.Vectors for !netstandard10 build configurations
- Vectorized SequenceEquals for Span<byte> and Span<char>
- Add workarounds for https://github.com/dotnet/coreclr/issues/9692
AndyAyersMS referenced this issue in AndyAyersMS/coreclr Apr 25, 2017
Low-tech approach to #9692. Finds forward branches to returns
and moves the return block later in the method.

Gives better layout in simple cases of search loops that return
when they find a result. Won't handle cases where there are multiple
non-loop blocks in a loop reachable from just one loop block, or
cases where there are non-loop blocks but not returns, say from
a `break` or similar construct.
@AndyAyersMS
Copy link
Member

Some perf results from the low-tech dotnet/coreclr#11192:

  • Fasta (~6%): moved return block out of loop in SelectRandom
  • TreeSort (~5%): compacted loop body in Insert somewhat

There is also what looks like a win in some of the Linq Where tests. Hard to be 100% sure based on diffs since dotnet/coreclr#11192 is moving blocks that I don't expect it to move -- likely a limitation of using BBF_BACKWARDS_JUMP as a crude in-loop detector.

The tree sort case is a good example of why something more general is warranted, as there are three loop exits, two series of non-loop blocks and two backedges. There's likely an even bigger win if we can generally move all the non-loop code out. And we'd want something that works for breaks from inner loops in nests and not just for returns.

I'm going to put this up for consideration in 2.1 since it looks like we really should fix this.

@bbowyersmyth
Copy link
Contributor Author

No idea how advanced LSDs are but is it still able to optimize if the body is not in the loop?

@redknightlois
Copy link

This is a common and measurable optimization on our tightest code. Definitely a win if we can avoid writing so many go-to statements.

JosephTremoulet referenced this issue in JosephTremoulet/coreclr Aug 7, 2017
Rearrange basic blocks during loop identification so that loop bodies
are kept contiguous when possible.  Blocks in the lexical range of the
loop which do not participate in the flow cycle (which typically
correspond to code associated with early exits using `break` or
`return`) are moved out below the loop when possible without breaking EH
region nesting.  The target insertion point, when possible, is chosen to
be the first spot below the loop that will not break up fall-through.

Layout can significantly affect the performance of loops, particularly
small search loops, by avoiding the taken branch on the hot path,
improving the locality of the code fetched while iterating the loop, and
potentially aiding loop stream detection.

Resolves #9692.
JosephTremoulet referenced this issue in JosephTremoulet/coreclr Aug 8, 2017
Rearrange basic blocks during loop identification so that loop bodies
are kept contiguous when possible.  Blocks in the lexical range of the
loop which do not participate in the flow cycle (which typically
correspond to code associated with early exits using `break` or
`return`) are moved out below the loop when possible without breaking EH
region nesting.  The target insertion point, when possible, is chosen to
be the first spot below the loop that will not break up fall-through.

Layout can significantly affect the performance of loops, particularly
small search loops, by avoiding the taken branch on the hot path,
improving the locality of the code fetched while iterating the loop, and
potentially aiding loop stream detection.

Resolves #9692.
JosephTremoulet referenced this issue in JosephTremoulet/coreclr Aug 9, 2017
Rearrange basic blocks during loop identification so that loop bodies
are kept contiguous when possible.  Blocks in the lexical range of the
loop which do not participate in the flow cycle (which typically
correspond to code associated with early exits using `break` or
`return`) are moved out below the loop when possible without breaking EH
region nesting.  The target insertion point, when possible, is chosen to
be the first spot below the loop that will not break up fall-through.

Layout can significantly affect the performance of loops, particularly
small search loops, by avoiding the taken branch on the hot path,
improving the locality of the code fetched while iterating the loop, and
potentially aiding loop stream detection.

Resolves #9692.
JosephTremoulet referenced this issue in JosephTremoulet/coreclr Aug 9, 2017
Rearrange basic blocks during loop identification so that loop bodies
are kept contiguous when possible.  Blocks in the lexical range of the
loop which do not participate in the flow cycle (which typically
correspond to code associated with early exits using `break` or
`return`) are moved out below the loop when possible without breaking EH
region nesting.  The target insertion point, when possible, is chosen to
be the first spot below the loop that will not break up fall-through.

Layout can significantly affect the performance of loops, particularly
small search loops, by avoiding the taken branch on the hot path,
improving the locality of the code fetched while iterating the loop, and
potentially aiding loop stream detection.

Resolves #9692.
JosephTremoulet referenced this issue in JosephTremoulet/coreclr Aug 9, 2017
Rearrange basic blocks during loop identification so that loop bodies
are kept contiguous when possible.  Blocks in the lexical range of the
loop which do not participate in the flow cycle (which typically
correspond to code associated with early exits using `break` or
`return`) are moved out below the loop when possible without breaking EH
region nesting.  The target insertion point, when possible, is chosen to
be the first spot below the loop that will not break up fall-through.

Layout can significantly affect the performance of loops, particularly
small search loops, by avoiding the taken branch on the hot path,
improving the locality of the code fetched while iterating the loop, and
potentially aiding loop stream detection.

Resolves #9692.
JosephTremoulet referenced this issue in JosephTremoulet/coreclr Aug 10, 2017
Rearrange basic blocks during loop identification so that loop bodies
are kept contiguous when possible.  Blocks in the lexical range of the
loop which do not participate in the flow cycle (which typically
correspond to code associated with early exits using `break` or
`return`) are moved out below the loop when possible without breaking EH
region nesting.  The target insertion point, when possible, is chosen to
be the first spot below the loop that will not break up fall-through.

Layout can significantly affect the performance of loops, particularly
small search loops, by avoiding the taken branch on the hot path,
improving the locality of the code fetched while iterating the loop, and
potentially aiding loop stream detection.

Resolves #9692.
JosephTremoulet referenced this issue in JosephTremoulet/coreclr Aug 15, 2017
Rearrange basic blocks during loop identification so that loop bodies
are kept contiguous when possible.  Blocks in the lexical range of the
loop which do not participate in the flow cycle (which typically
correspond to code associated with early exits using `break` or
`return`) are moved out below the loop when possible without breaking EH
region nesting.  The target insertion point, when possible, is chosen to
be the first spot below the loop that will not break up fall-through.

Layout can significantly affect the performance of loops, particularly
small search loops, by avoiding the taken branch on the hot path,
improving the locality of the code fetched while iterating the loop, and
potentially aiding loop stream detection.

Resolves #9692.
JosephTremoulet referenced this issue in JosephTremoulet/coreclr Aug 17, 2017
Rearrange basic blocks during loop identification so that loop bodies
are kept contiguous when possible.  Blocks in the lexical range of the
loop which do not participate in the flow cycle (which typically
correspond to code associated with early exits using `break` or
`return`) are moved out below the loop when possible without breaking EH
region nesting.  The target insertion point, when possible, is chosen to
be the first spot below the loop that will not break up fall-through.

Layout can significantly affect the performance of loops, particularly
small search loops, by avoiding the taken branch on the hot path,
improving the locality of the code fetched while iterating the loop, and
potentially aiding loop stream detection.

Resolves #9692.
@danmoseley
Copy link
Member

this can be removed
C:\git\coreclr\src\mscorlib\src\System\String.Comparison.cs:
47: goto ReturnCharAMinusCharB; // TODO: Workaround for https://github.com/dotnet/coreclr/issues/9692

@JosephTremoulet
Copy link
Contributor

this can be removed
C:\git\coreclr\src\mscorlib\src\System\String.Comparison.cs:
47: goto ReturnCharAMinusCharB; // TODO: Workaround for dotnet/coreclr#9692

See https://github.com/dotnet/coreclr/issues/13466

JosephTremoulet referenced this issue in JosephTremoulet/coreclr Aug 21, 2017
Remove some `goto`s that were added to work around #9692 (poor code
layout for loop exit paths) -- the JIT's layout decisions were improved
in dotnet#13314, and these particular `goto`s are no longer needed; crossgen
of System.Private.CoreLib now produces the same machine code with or
without this change.

Part of #13466.
danmoseley referenced this issue in dotnet/coreclr Aug 22, 2017
Remove some `goto`s that were added to work around #9692 (poor code
layout for loop exit paths) -- the JIT's layout decisions were improved
in #13314, and these particular `goto`s are no longer needed; crossgen
of System.Private.CoreLib now produces the same machine code with or
without this change.

Part of #13466.
JosephTremoulet referenced this issue in JosephTremoulet/corefx Aug 23, 2017
Remove some `goto`s that were added to work around dotnet/coreclr#9692
(poor code layout for loop exit paths) -- the JIT's layout decisions
were improved in dotnet/coreclr#13314, and these particular `goto`s are
no longer needed; the same machine code is generated with or without
this change.
Some `goto`s previously tagged as workarounds for dotnet/coreclr#9692 are
still relevant for keeping codesize down pending dotnet/coreclr#13549;
update their comments accordingly.

Part of #23395.
JosephTremoulet referenced this issue in JosephTremoulet/corefx Aug 23, 2017
Remove some `goto`s that were added to work around dotnet/coreclr#9692
(poor code layout for loop exit paths) -- the JIT's layout decisions
were improved in dotnet/coreclr#13314, and these particular `goto`s are
no longer needed; the same machine code is generated with or without
this change.
Some `goto`s previously tagged as workarounds for dotnet/coreclr#9692 are
still relevant for keeping codesize down pending dotnet/coreclr#13549;
update their comments accordingly.

Part of #23395.
JosephTremoulet referenced this issue in JosephTremoulet/corefx Aug 23, 2017
Remove some `goto`s that were added to work around dotnet/coreclr#9692
(poor code layout for loop exit paths) -- the JIT's layout decisions
were improved in dotnet/coreclr#13314, and these particular `goto`s are
no longer needed; the same machine code is generated with or without
this change.
Some `goto`s previously tagged as workarounds for dotnet/coreclr#9692 are
still relevant for keeping codesize down pending dotnet/coreclr#13549;
update their comments accordingly.

Part of #23395.
jkotas referenced this issue in dotnet/corefx Aug 24, 2017
Remove some `goto`s that were added to work around dotnet/coreclr#9692
(poor code layout for loop exit paths) -- the JIT's layout decisions
were improved in dotnet/coreclr#13314, and these particular `goto`s are
no longer needed; the same machine code is generated with or without
this change.
Some `goto`s previously tagged as workarounds for dotnet/coreclr#9692 are
still relevant for keeping codesize down pending dotnet/coreclr#13549;
update their comments accordingly.

Part of #23395.
JosephTremoulet referenced this issue in JosephTremoulet/coreclr Sep 13, 2017
Remove some `goto`s that were added  to work around undesirable jit
layout (#9692, fixed in dotnet#13314) and epilog factoring (improved in
 dotnet#13792 and dotnet#13903), which are no longer needed.

Resolves #13466.
jkotas referenced this issue in dotnet/coreclr Sep 14, 2017
Remove some `goto`s that were added  to work around undesirable jit
layout (#9692, fixed in #13314) and epilog factoring (improved in
 #13792 and #13903), which are no longer needed.

Resolves #13466.
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

8 participants