-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Do not spill eeStack after ldtoken opcode. #10215
Do not spill eeStack after ldtoken opcode. #10215
Conversation
PTAL @dotnet/jit-contrib
Lab tests didn't show any difference in throughput. |
Even linear recursive algorithms would suffer a stack overflow if the recursion got too deep.
I'd guess that's because those lab tests are measuring common-case inputs, and this code is an escape valve for "pathological" worst-case inputs. You could try running this change through SPMI to see if any tests suddenly take forever or fail with stack overflow. You could also hand-craft inputs with arbitrarily high stack depths, but without knowing which algorithms might choke there'd be some art/luck to crafting an input that would trigger problematic recursion.
It needs to be low enough that we won't fail with stack overflow, and that even for "pathological" inputs we don't spend hours trying to jit them. It needs to be high enough that it won't hamper optimizations in, say, 99.999999% of code. So I'd suggest coming at it from the other side -- in addition to seeing how high you can make it without our tests falling over, see how low can you make it without our tests having asm diffs. If we have those upper/lower bounds then we can argue about where in that range we'd guess would most often be in the good range for inputs outside our test suite. |
and/or add some instrumentation to measure how big the trees get, run that on top of your change to remove the limit, and build up some histograms of how frequent each tree size is on various inputs |
Some context might help. If this is related to #9948 then we're not talking about having better or worse CQ but about failing to recognize certain patterns that the JIT is required to recognize in certain cases. |
Yes, it is related to that issue. However, Jit works better with trees in all phases and this inconspicuous transformation might cause worse register allocation at least. |
Yes, that's true but optimizations aren't a must have, you can make a reasonable compromise between tree depth and CQ. But in the case of your issue it's not clear if a compromise is possible and how it may look. Side note: there's a somewhat related issue - #8119. |
Sounds like a worthwhile experiment would be to do this where the hand-crafted input is using the InitializeArray pattern from #9948 -- if such an input blows our stack, then we'll either need a more robust implementation of the transform that CoreRT requires or we'll need to live with a (new, lower) max array rank/size/whatever in CoreRT. |
@JosephTremoulet The pattern in question is this: https://github.com/dotnet/corefx/blob/f38d75c8a10309b34213e5d4a96313f538194392/src/Microsoft.CSharp/src/Microsoft/CSharp/RuntimeBinder/Semantics/PredefinedMembers.cs#L765 It's not about using big arrays, or arrays with too many dimensions. This is how the relevant part of the IL looks like:
The above is the 109-element array allocation and storing the first element of it. The lines after Is the problem that the stack is not empty and RyuJIT can't spill it in a clean way, so it will insert an artificial spill? |
What's going on with
|
Ah, sorry. I didn't realize the first element of the array is so short that the C# compiler initialized it with code - here's the full NGenDump (that is - until we hit the noway assert saying that a The first element of the array that is initialized using the helper is at |
Thanks, I think I'm following now. So we're not trying to pattern-match arbitrarily large IL sequences, we're trying to pattern-match sequences ranging in size from five to sixty-eight IL opcodes. And in the problem case, we just happen to get unlucky in that the "have I read in a lot of IL without hitting an empty evaluation stack?" check hits its limit (of 200 bytes of IL) right in the middle of that 5-instruction sequence (due in this case to an outer array having a long sequence without an empty evaluation stack). It's a little odd to me that we're trying to pattern-match an IL sequence, but we're doing it after the sequence has already been translated to RyuJit IR. It's nice for throughput that this only kicks in when there's actually a call to On the other hand, the code that does the spilling passes the default value of false for
So it seems to me (but someone should verify) that the problem only happens if the "it's been too long since we've had an empty evaluation stack" check kicks in right between the |
How about something low tech like passing a flag to the jit indicating "this method contains array initializers" that would then be used to suppress the too many IL opcodes since last spill logic entirely for certain methods. Or maybe there is some other way to limit the set of methods where they might show up, eg if they only appear in a |
It doesn't look right to create another workaround in addition to the existing one.
|
2fbfc6f
to
8391d4d
Compare
b6d776e
to
5cacd40
Compare
I updated PR according to @JosephTremoulet proposition. As I understand it is not possible to cause stack overflow after this change. Even more we could forbid spill after all leaves opcodes. |
Created PR with deep tree #10835 to fail the initial PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good with some comments.
If you wanted to mitigate against very long sequences of ldtokens, you could add a check that the opcode before ldtoken
is dup
(which would require tracking prevPrevOpcode
or maybe lastDupOffset
[similar to delegateCreateStart
]), but I believe you're right it's not necessary.
// | ||
// Return Value: | ||
// true if it is legal, false if it could be a sequence that we do not want to divide. | ||
bool Compiler::impCanSpillNow(OPCODE prevOpcode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems odd to shadow the prevOpcode
member with a prevOpcode
parameter. Did you want to make this static? I'd lean toward making it parameterless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prevOpcode is not member of compiler, do we want to make it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sorry, I was just misreading (I thought you were loading a member at the callsite, but I see now you're passing a local), it's good how you have it.
src/jit/importer.cpp
Outdated
bool Compiler::impCanSpillNow(OPCODE prevOpcode) | ||
{ | ||
StackEntry& lastSE = impStackTop(0); | ||
GenTreePtr tree = lastSE.val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like tree
is unused (and then so is lastSE
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I though that it is possible to restore last opcode from the tree, but I was wrong.
src/jit/importer.cpp
Outdated
GenTreePtr tree = lastSE.val; | ||
if (prevOpcode == CEE_LDTOKEN) | ||
{ | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a comment referencing/explaining the InitializeArray sequence that we're avoiding breaking up to guarantee that impInitializeArrayIntrinsic
can succeed.
src/jit/importer.cpp
Outdated
{ | ||
return false; | ||
} | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this could be just return prevOpcode != CEE_LDTOKEN;
src/jit/compiler.h
Outdated
@@ -3102,6 +3102,10 @@ class Compiler | |||
|
|||
//---------------- Spilling the importer stack ---------------------------- | |||
|
|||
// The maximum tree size in the importer before it spills all trees from the stack into local variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"maximum tree size..." is a bit misleading, as this is actually "maximum number of bytes of IL processed..."
I fixed comments, please look again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Picks up dotnet/coreclr#10215.
Picks up dotnet/coreclr#10215.
There is old code that spills trees on stack every 200 IL opcodes.
I suppose that it was done because there were non-linear recursive algorithms. It happened many years ago before tf source control and I was not able to find who did it.
Now we should not have them and there is no difference between
10 trees * 100 nodes each or one tree with 1000 nodes in it.
This spilling might create perfomance regressions because some
optimizations work only with tree values.