-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoopCloneContext::EvaluateConditions need to evaluate for const init, limit condition. #10314
Comments
Note this is related to loop unrolling optimization issue here https://github.com/dotnet/coreclr/issues/11606# |
Hmm, you really picked a funny example to start with. This is one of those where loop cloning kicks in and clones the loop but then other optimizations remove the redundant range check and you're left with 2 identical loop versions.
The limit of the loop is const and the array length is also const but the later might be problematic to figure out at the point where loop cloning runs, before SSA/VN.
It would probably make sense to try to unroll the loop version without range checks. Unrolling the one with range checks will just generate more code for no benefit. |
In fact loop cloning + unrolling could be a pretty interesting optimization when the array length is not known (e.g. function parameter). Code like int total = 0;
for (int i = 0; i < 4; ++i)
total += arr[i];
return total; could generate something like G_M55886_IG01:
4883EC28 sub rsp, 40
G_M55886_IG02:
33C0 xor eax, eax
33D2 xor edx, edx
4885C9 test rcx, rcx
7417 je SHORT G_M55886_IG05
G_M55886_IG03:
83790804 cmp dword ptr [rcx+8], 4
7C11 jl SHORT G_M55886_IG05
G_M55886_IG04:
4203448110 add eax, dword ptr [rcx+16]
4203448110 add eax, dword ptr [rcx+20]
4203448110 add eax, dword ptr [rcx+24]
4203448110 add eax, dword ptr [rcx+28]
G_M55886_IG06:
4883C428 add rsp, 40
C3 ret
G_M55886_IG05:
3B5108 cmp edx, dword ptr [rcx+8]
7314 jae SHORT G_M55886_IG07
4C63C2 movsxd r8, edx
4203448110 add eax, dword ptr [rcx+4*r8+16]
FFC2 inc edx
83FA04 cmp edx, 4
7CEC jl SHORT G_M55886_IG05
EB14 jmp SHORT G_M55886_IG06
G_M55886_IG07:
E8FEC3395F call CORINFO_HELP_RNGCHKFAIL
CC int3
|
Yes, but first. we have to evaluate constant backedge taken count for loop so seems the solution is just move transforming [ADDED]
does this code really needs safe array checks? |
Yes, of course. You have no way of knowing that the array length is >= 4. |
I mean, this code will be failed if the array length is less than 4 whatever its optimized or not. |
It is unsafe in the sense that it may throw an |
I understood, Thanks for explaining. (I didn't know that because I just moved here from LLVM) |
Yes. The only flexibility around array range checks is some reordering (hoisting checks out of loops) and that only in special circumstances, when CompliationRelaxation attributes is used. The JIT does not use that today. Other than that eliminating range checks requires proving that they are redundant. The JIT does that today in a number of places - loop cloning, EarlyProp, AssertionProp and RangeCheck AFAIR. |
Okay thanks, I will take closer to fix it. :> |
Okay.... so now I understood that is really need but um
what you excepted is this right? |
This seems more reasonable code generations. |
In the specific case of loop cloning you know that the loop version that has range checks won't be normally executed, people just don't pass a 2 element array when a 4 element array is expected. And even if they do, this results in throwing an exception and that's far more costly than the loop code. In short - rarely run code + transform that increases code size = not useful. And if we ignore loop cloning for a while there's another issue with unrolling loops that contain range checks. If you can't prove that the range checks are redundant then unrolling generates quite a bit more code than in C/C++. And not only that the code is larger, it also contains more branches. That's going to make it difficult to evaluate the impact unrolling has on overall application performance. A microbenchmark might show that it works well but those branches will usually consume entries from CPU's branch table that's used for branch prediction. That may result in other branches being thrown out, possible having a negative impact on branch prediction in other parts of the code. |
That's a problematic transform to make in the absence of those compilation relaxations option I mentioned above. You have to prove that throwing the exception before the loop does not result in observable side effects being lost. For example, if the loop contains |
without I have no idea how to solve this. Thanks for explaining. |
I don't think this has anything to do with noexcept. The main difference between C/C++ and C# here is the presence of range checks.
As mentioned previously, you could try to unroll the cloned loop that does not contain range checks while leaving the one with range checks alone. Another option is to look at loops that do not use arrays but pointers (unsafe code). That would be more similar to C/C++. At the same time, I have no idea how useful is that - that is, how much real world code would benefit from this. |
Just don't mind :> I was just joking. Just never mind. also okay, I will try it with pointers :> |
That is - use unsafe code :)
|
@ArtBlnd I think you're starting to get a sense of why optimizations in the jit frequently pose challenges you may not see when coming from a C/C++ centric optimizer. Exceptions are a fundamental part of the .Net ecosystem and need careful and constant consideration. There's also this cruel irony: many different operations can cause exceptions, so there is a high density of potential exception sites in the IR. But, at runtime, exceptions almost never happen. This typically leads us to view exceptions as obstacles or barriers that need to be removed or mitigated to allow better optimization of the common case where no exceptions happen But it also means that it is dangerously easy to develop optimizations that invalidly reorder or suppress exceptions and then not realize this until much later when the right case comes along. The second thing I hope you're sensing is that loop optimization needs to be guided by some overall strategy and the various transformations need to work together in a coordinated fashion. We are a long ways from having that in the JIT. You are more then welcome to help us figure out how to make it better. |
Thanks @AndyAyersMS. also it's such pleasure to help me out :> Thanks :> |
@ArtBlnd -- No worries, you are doing great. |
Thanks. um can I ask you that where can I get helped to get used on coreclr's code. for example. I am new, yes. but I don't see any docs like doxygen over here. Thanks. |
A few resources on the JIT: And more broadly (covering the runtime, gc etc): |
I don't see any specific, unique work here, so I'm going to close this. |
current EvaluateConditions should evaluate const-init, const-limit for not cloning very natural loops
this is actually statically known as true. (always taken)
this makes loop to clone(evaluate as not statically known as true or false), which causes not to unroll general natural loops (flagged as
LPFLG_DONT_UNROLL
)actually, this loop dosn't really needs to be cloned. (or maybe needs implements for evaluate loop taken count?)
reason why evaluate conditions needs to evaluate const-init, const-limit because if this code unrolled, the
for
expressions is be always taken. so its statically taken.here is example
but after loop-cloning will not to perform unroll-loop.
this is like ironic, like a deadlock.
I think transforming
for
orforeach
expressions intodo~while
need to be separated method fromoptCloneLoops
, which that doesn't really need to be cloned.category:cq
theme:loop-opt
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: