-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible JIT improvement for loops with return statements #7474
Comments
The inline return will add a bunch of additional |
Apart from a branch misprediction are there other factors at play that make skipping over multiple instructions more costly than one for a given iteration? |
That branch should be correctly predicted so that's not an issue. One possible explanation for the behavior you are seeing is that predicted and taken branches and slightly more expensive than predicated and not taken branches: G_M11558_IG05:
0FB708 movzx rcx, word ptr [rax]
440FB70A movzx r9, word ptr [rdx]
413BC9 cmp ecx, r9d
; predicted and taken every loop iteration
7407 je SHORT G_M11558_IG07
33C0 xor eax, eax
G_M11558_IG06:
4883C418 add rsp, 24
C3 ret
G_M11558_IG07:
4883C002 add rax, 2
4883C202 add rdx, 2
41FFC8 dec r8d
4585C0 test r8d, r8d
75DD jne SHORT G_M11558_IG05
; versus
G_M50079_IG05:
440FB701 movzx r8, word ptr [rcx]
440FB70A movzx r9, word ptr [rdx]
453BC1 cmp r8d, r9d
; predicted and NOT taken every loop iteration
7518 jne SHORT G_M50079_IG08
4883C102 add rcx, 2
4883C202 add rdx, 2
FFC8 dec eax
85C0 test eax, eax
75E5 jne SHORT G_M50079_IG05 Anyway, it's best to pull non-loop code out of loops for various reasons including branch cost and LSD. |
- Add System.Memory dependency on System.Vectors for !netstandard10 build configurations - Vectorized SequenceEquals for Span<byte> and Span<char> - Add workarounds for https://github.com/dotnet/coreclr/issues/9692
* Span performance improvements - Add System.Memory dependency on System.Vectors for !netstandard10 build configurations - Vectorized SequenceEquals for Span<byte> and Span<char> - Add workarounds for https://github.com/dotnet/coreclr/issues/9692
Low-tech approach to #9692. Finds forward branches to returns and moves the return block later in the method. Gives better layout in simple cases of search loops that return when they find a result. Won't handle cases where there are multiple non-loop blocks in a loop reachable from just one loop block, or cases where there are non-loop blocks but not returns, say from a `break` or similar construct.
Some perf results from the low-tech dotnet/coreclr#11192:
There is also what looks like a win in some of the Linq The tree sort case is a good example of why something more general is warranted, as there are three loop exits, two series of non-loop blocks and two backedges. There's likely an even bigger win if we can generally move all the non-loop code out. And we'd want something that works for breaks from inner loops in nests and not just for returns. I'm going to put this up for consideration in 2.1 since it looks like we really should fix this. |
No idea how advanced LSDs are but is it still able to optimize if the body is not in the loop? |
This is a common and measurable optimization on our tightest code. Definitely a win if we can avoid writing so many go-to statements. |
Rearrange basic blocks during loop identification so that loop bodies are kept contiguous when possible. Blocks in the lexical range of the loop which do not participate in the flow cycle (which typically correspond to code associated with early exits using `break` or `return`) are moved out below the loop when possible without breaking EH region nesting. The target insertion point, when possible, is chosen to be the first spot below the loop that will not break up fall-through. Layout can significantly affect the performance of loops, particularly small search loops, by avoiding the taken branch on the hot path, improving the locality of the code fetched while iterating the loop, and potentially aiding loop stream detection. Resolves #9692.
Rearrange basic blocks during loop identification so that loop bodies are kept contiguous when possible. Blocks in the lexical range of the loop which do not participate in the flow cycle (which typically correspond to code associated with early exits using `break` or `return`) are moved out below the loop when possible without breaking EH region nesting. The target insertion point, when possible, is chosen to be the first spot below the loop that will not break up fall-through. Layout can significantly affect the performance of loops, particularly small search loops, by avoiding the taken branch on the hot path, improving the locality of the code fetched while iterating the loop, and potentially aiding loop stream detection. Resolves #9692.
Rearrange basic blocks during loop identification so that loop bodies are kept contiguous when possible. Blocks in the lexical range of the loop which do not participate in the flow cycle (which typically correspond to code associated with early exits using `break` or `return`) are moved out below the loop when possible without breaking EH region nesting. The target insertion point, when possible, is chosen to be the first spot below the loop that will not break up fall-through. Layout can significantly affect the performance of loops, particularly small search loops, by avoiding the taken branch on the hot path, improving the locality of the code fetched while iterating the loop, and potentially aiding loop stream detection. Resolves #9692.
Rearrange basic blocks during loop identification so that loop bodies are kept contiguous when possible. Blocks in the lexical range of the loop which do not participate in the flow cycle (which typically correspond to code associated with early exits using `break` or `return`) are moved out below the loop when possible without breaking EH region nesting. The target insertion point, when possible, is chosen to be the first spot below the loop that will not break up fall-through. Layout can significantly affect the performance of loops, particularly small search loops, by avoiding the taken branch on the hot path, improving the locality of the code fetched while iterating the loop, and potentially aiding loop stream detection. Resolves #9692.
Rearrange basic blocks during loop identification so that loop bodies are kept contiguous when possible. Blocks in the lexical range of the loop which do not participate in the flow cycle (which typically correspond to code associated with early exits using `break` or `return`) are moved out below the loop when possible without breaking EH region nesting. The target insertion point, when possible, is chosen to be the first spot below the loop that will not break up fall-through. Layout can significantly affect the performance of loops, particularly small search loops, by avoiding the taken branch on the hot path, improving the locality of the code fetched while iterating the loop, and potentially aiding loop stream detection. Resolves #9692.
Rearrange basic blocks during loop identification so that loop bodies are kept contiguous when possible. Blocks in the lexical range of the loop which do not participate in the flow cycle (which typically correspond to code associated with early exits using `break` or `return`) are moved out below the loop when possible without breaking EH region nesting. The target insertion point, when possible, is chosen to be the first spot below the loop that will not break up fall-through. Layout can significantly affect the performance of loops, particularly small search loops, by avoiding the taken branch on the hot path, improving the locality of the code fetched while iterating the loop, and potentially aiding loop stream detection. Resolves #9692.
Rearrange basic blocks during loop identification so that loop bodies are kept contiguous when possible. Blocks in the lexical range of the loop which do not participate in the flow cycle (which typically correspond to code associated with early exits using `break` or `return`) are moved out below the loop when possible without breaking EH region nesting. The target insertion point, when possible, is chosen to be the first spot below the loop that will not break up fall-through. Layout can significantly affect the performance of loops, particularly small search loops, by avoiding the taken branch on the hot path, improving the locality of the code fetched while iterating the loop, and potentially aiding loop stream detection. Resolves #9692.
Rearrange basic blocks during loop identification so that loop bodies are kept contiguous when possible. Blocks in the lexical range of the loop which do not participate in the flow cycle (which typically correspond to code associated with early exits using `break` or `return`) are moved out below the loop when possible without breaking EH region nesting. The target insertion point, when possible, is chosen to be the first spot below the loop that will not break up fall-through. Layout can significantly affect the performance of loops, particularly small search loops, by avoiding the taken branch on the hot path, improving the locality of the code fetched while iterating the loop, and potentially aiding loop stream detection. Resolves #9692.
this can be removed |
|
Remove some `goto`s that were added to work around #9692 (poor code layout for loop exit paths) -- the JIT's layout decisions were improved in dotnet#13314, and these particular `goto`s are no longer needed; crossgen of System.Private.CoreLib now produces the same machine code with or without this change. Part of #13466.
Remove some `goto`s that were added to work around #9692 (poor code layout for loop exit paths) -- the JIT's layout decisions were improved in #13314, and these particular `goto`s are no longer needed; crossgen of System.Private.CoreLib now produces the same machine code with or without this change. Part of #13466.
Remove some `goto`s that were added to work around dotnet/coreclr#9692 (poor code layout for loop exit paths) -- the JIT's layout decisions were improved in dotnet/coreclr#13314, and these particular `goto`s are no longer needed; the same machine code is generated with or without this change. Some `goto`s previously tagged as workarounds for dotnet/coreclr#9692 are still relevant for keeping codesize down pending dotnet/coreclr#13549; update their comments accordingly. Part of #23395.
Remove some `goto`s that were added to work around dotnet/coreclr#9692 (poor code layout for loop exit paths) -- the JIT's layout decisions were improved in dotnet/coreclr#13314, and these particular `goto`s are no longer needed; the same machine code is generated with or without this change. Some `goto`s previously tagged as workarounds for dotnet/coreclr#9692 are still relevant for keeping codesize down pending dotnet/coreclr#13549; update their comments accordingly. Part of #23395.
Remove some `goto`s that were added to work around dotnet/coreclr#9692 (poor code layout for loop exit paths) -- the JIT's layout decisions were improved in dotnet/coreclr#13314, and these particular `goto`s are no longer needed; the same machine code is generated with or without this change. Some `goto`s previously tagged as workarounds for dotnet/coreclr#9692 are still relevant for keeping codesize down pending dotnet/coreclr#13549; update their comments accordingly. Part of #23395.
Remove some `goto`s that were added to work around dotnet/coreclr#9692 (poor code layout for loop exit paths) -- the JIT's layout decisions were improved in dotnet/coreclr#13314, and these particular `goto`s are no longer needed; the same machine code is generated with or without this change. Some `goto`s previously tagged as workarounds for dotnet/coreclr#9692 are still relevant for keeping codesize down pending dotnet/coreclr#13549; update their comments accordingly. Part of #23395.
Remove some `goto`s that were added to work around undesirable jit layout (#9692, fixed in dotnet#13314) and epilog factoring (improved in dotnet#13792 and dotnet#13903), which are no longer needed. Resolves #13466.
Loops with a return statement in the body run slower than those without. It would be good if the JIT had a way to optimize this. If tracking this was too complex perhaps moving
return [true|false]
statements could be a starting point.Test code
https://gist.github.com/bbowyersmyth/9514af463745528d8d290e7cd2492660
The very simple loop runs 85.7ns vs 67.4ns (20% difference). The gap can widen with additional instructions added to the body.
Current theory is that this is due to the CPUs complexity rules for the loop stream detector.
Initial suggestion by @jkotas dotnet/coreclr#2667 (comment)
Recent discussion dotnet/coreclr#9213
cc @mikedn
The text was updated successfully, but these errors were encountered: