Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Do not spill eeStack after ldtoken opcode. #10215

Merged
merged 3 commits into from
Apr 11, 2017

Conversation

sandreenko
Copy link

@sandreenko sandreenko commented Mar 16, 2017

There is old code that spills trees on stack every 200 IL opcodes.
I suppose that it was done because there were non-linear recursive algorithms. It happened many years ago before tf source control and I was not able to find who did it.
Now we should not have them and there is no difference between
10 trees * 100 nodes each or one tree with 1000 nodes in it.
This spilling might create perfomance regressions because some
optimizations work only with tree values.

@sandreenko sandreenko added * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) area-CodeGen labels Mar 16, 2017
@sandreenko
Copy link
Author

PTAL @dotnet/jit-contrib
I do not suppose to merge this code, but want to discuss it.
It would be nice to at least rewrite it:

  1. place 200 constant into variable;
  2. find the optimal value for this constant;
  3. if the optimal value will be infinite - delete this code.

Lab tests didn't show any difference in throughput.

@JosephTremoulet
Copy link

I suppose that it was done because there were non-linear recursive algorithms.

Even linear recursive algorithms would suffer a stack overflow if the recursion got too deep.

Lab tests didn't show any difference in throughput

I'd guess that's because those lab tests are measuring common-case inputs, and this code is an escape valve for "pathological" worst-case inputs. You could try running this change through SPMI to see if any tests suddenly take forever or fail with stack overflow. You could also hand-craft inputs with arbitrarily high stack depths, but without knowing which algorithms might choke there'd be some art/luck to crafting an input that would trigger problematic recursion.

find the optimal value for this constant

It needs to be low enough that we won't fail with stack overflow, and that even for "pathological" inputs we don't spend hours trying to jit them. It needs to be high enough that it won't hamper optimizations in, say, 99.999999% of code.

So I'd suggest coming at it from the other side -- in addition to seeing how high you can make it without our tests falling over, see how low can you make it without our tests having asm diffs. If we have those upper/lower bounds then we can argue about where in that range we'd guess would most often be in the good range for inputs outside our test suite.

@JosephTremoulet
Copy link

see how low can you make it without our tests having asm diffs

and/or add some instrumentation to measure how big the trees get, run that on top of your change to remove the limit, and build up some histograms of how frequent each tree size is on various inputs

@mikedn
Copy link

mikedn commented Mar 16, 2017

Some context might help. If this is related to #9948 then we're not talking about having better or worse CQ but about failing to recognize certain patterns that the JIT is required to recognize in certain cases.

@sandreenko
Copy link
Author

sandreenko commented Mar 16, 2017

Some context might help. If this is related to #9948 then we're not talking about having better or worse CQ but about failing to recognize certain patterns that the JIT is required to recognize in certain cases.

Yes, it is related to that issue. However, Jit works better with trees in all phases and this inconspicuous transformation might cause worse register allocation at least.

@mikedn
Copy link

mikedn commented Mar 16, 2017

However, Jit works better with trees in all phases and this inconspicuous transformation might cause worse register allocation at least.

Yes, that's true but optimizations aren't a must have, you can make a reasonable compromise between tree depth and CQ. But in the case of your issue it's not clear if a compromise is possible and how it may look.

Side note: there's a somewhat related issue - #8119.

@JosephTremoulet
Copy link

You could also hand-craft inputs with arbitrarily high stack depths, but without knowing which algorithms might choke there'd be some art/luck to crafting an input that would trigger problematic recursion.

Sounds like a worthwhile experiment would be to do this where the hand-crafted input is using the InitializeArray pattern from #9948 -- if such an input blows our stack, then we'll either need a more robust implementation of the transform that CoreRT requires or we'll need to live with a (new, lower) max array rank/size/whatever in CoreRT.

@MichalStrehovsky
Copy link
Member

or we'll need to live with a (new, lower) max array rank/size/whatever in CoreRT.

@JosephTremoulet The pattern in question is this: https://github.com/dotnet/corefx/blob/f38d75c8a10309b34213e5d4a96313f538194392/src/Microsoft.CSharp/src/Microsoft/CSharp/RuntimeBinder/Semantics/PredefinedMembers.cs#L765

It's not about using big arrays, or arrays with too many dimensions. This is how the relevant part of the IL looks like:

  IL_0066:  ldc.i4.s   109
  IL_0068:  newarr     Microsoft.CSharp.RuntimeBinder.Semantics.PredefinedMethodInfo
  IL_006d:  dup
  IL_006e:  ldc.i4.0
  IL_006f:  ldc.i4.0
  IL_0070:  ldc.i4.1
  IL_0071:  ldc.i4.s   124
  IL_0073:  ldc.i4.s   103
  IL_0075:  ldc.i4.0
  IL_0076:  ldc.i4.5
  IL_0077:  ldc.i4.0
  IL_0078:  ldc.i4.2
  IL_0079:  newarr     [System.Runtime]System.Int32
  IL_007e:  dup
  IL_007f:  ldc.i4.0
  IL_0080:  ldc.i4.s   125
  IL_0082:  stelem.i4
  IL_0083:  newobj     instance void Microsoft.CSharp.RuntimeBinder.Semantics.PredefinedMethodInfo::.ctor(valuetype Microsoft.CSharp.RuntimeBinder.Semantics.PREDEFMETH,
                                                                                                          valuetype Microsoft.CSharp.RuntimeBinder.Semantics.MethodRequiredEnum,
                                                                                                          valuetype Microsoft.CSharp.RuntimeBinder.Syntax.PredefinedType,
                                                                                                          valuetype Microsoft.CSharp.RuntimeBinder.Syntax.PredefinedName,
                                                                                                          valuetype Microsoft.CSharp.RuntimeBinder.Semantics.MethodCallingConventionEnum,
                                                                                                          valuetype Microsoft.CSharp.RuntimeBinder.Semantics.ACCESS,
                                                                                                          int32,
                                                                                                          int32[])
  IL_0088:  stelem.ref
  IL_0089:  dup

The above is the 109-element array allocation and storing the first element of it. The lines after IL_006e are repeated 109 times. Note that the compiler didn't bother to create a temporary for the 109-element array, but keeps it on the stack with dups for the entire time it's populating the 109 elements.

Is the problem that the stack is not empty and RyuJIT can't spill it in a clean way, so it will insert an artificial spill?

@JosephTremoulet
Copy link

What's going on with InitializeArray?

I'm still not following this well enough to know. Issue #9948 seems to be talking about this code, which describes a different IL sequence than above. The comments around InitializeArray indicate that it initializes an array from bytes in the image's data section. The IL snippet you pasted is using code to initialize the array with a series of stelems... What am I missing? Is there a rewrite step somewhere that removes the stelems and puts the data inline in the image? What comes after all those stelems, is it a call to InitializeArray, and if so, why did it bother initializing the array with all that explicit code?

What's the JIT doing?

The JIT is just trying to make sure it doesn't create an expression tree (a GenTree in its IR) that's too big to process with a recursive algorithm (the primary limitation of that, I'd suspect, being stack overflow).

This code has a modest max stack depth, why does it trip the JIT's "too big tree" sensor?

For expediency's sake, the JIT is checking "have I read in a lot of IL without hitting an empty evaluation stack?" as a proxy for "am I building a very large expression?"

What do we do now?

That's not clear to me. The JIT almost certainly needs a limit on expression size to avoid crashing on "pathological" inputs (without going through and replacing every recursive algorithm with a worklist-based one...). Apparently (though I still don't understand the details of how, as explained in the first section of this comment), that translates to a limit on input sequences that the JIT can handle without demanding an implementation of InitializeArray from CoreRT. If we can find a way to rewrite the check so that it still avoids JIT crashes (maybe by explicitly checking expression depth/size?) but permits all the input patterns that CoreRT will need, cool. If we can find a way to rewrite the transform in the JIT so that it can identify the InitializeArray sequence regardless of whether this throttling code kicked in, great. I can't really picture how to do either of those without better understanding what patterns need to be handled (and what the IR looks like as we read those patterns in -- I'm imagining that we turn each stelem into a statement in the IR, but if instead we're building a giant comma tree or something then that would be trouble).

@MichalStrehovsky
Copy link
Member

I'm still not following this well enough to know. Issue #9948 seems to be talking about this code, which describes a different IL sequence than above

Ah, sorry. I didn't realize the first element of the array is so short that the C# compiler initialized it with code - here's the full NGenDump (that is - until we hit the noway assert saying that a mustExpand intrinsic wasn't expanded: ngendump.txt

The first element of the array that is initialized using the helper is at IL_00ba.

@JosephTremoulet
Copy link

Thanks, I think I'm following now. So we're not trying to pattern-match arbitrarily large IL sequences, we're trying to pattern-match sequences ranging in size from five to sixty-eight IL opcodes. And in the problem case, we just happen to get unlucky in that the "have I read in a lot of IL without hitting an empty evaluation stack?" check hits its limit (of 200 bytes of IL) right in the middle of that 5-instruction sequence (due in this case to an outer array having a long sequence without an empty evaluation stack).

It's a little odd to me that we're trying to pattern-match an IL sequence, but we're doing it after the sequence has already been translated to RyuJit IR. It's nice for throughput that this only kicks in when there's actually a call to InitializeArray, but we open ourselves to issues like this where the IL didn't get translated how we'd expect for whatever reason. Until recently, our processing for dup had some code to look ahead at the next opcode and process some patterns specially; something like that would be more robust.

On the other hand, the code that does the spilling passes the default value of false for spillLeaves, and it looks like the InitializeArray sequence would mostly generate leaves:

  • ldc generates constants which are leaves
  • newarr/newobj generate calls which aren't, but
  • dup would spill the call from newarr/newobj and our InitializeArray pattern-match code is already expecting that
  • aside from creating that spill, dup should just put two leaves on the stack
  • I don't think that dup would redundantly spill if the bail-out happened to hit just before it and spill the call

So it seems to me (but someone should verify) that the problem only happens if the "it's been too long since we've had an empty evaluation stack" check kicks in right between the ldtoken and the call to InitializeArray, yes? So then I'd think that simply suppressing the spill if the last opcode was ldtoken would work.

@AndyAyersMS
Copy link
Member

How about something low tech like passing a flag to the jit indicating "this method contains array initializers" that would then be used to suppress the too many IL opcodes since last spill logic entirely for certain methods.

Or maybe there is some other way to limit the set of methods where they might show up, eg if they only appear in a .cctor or something similar the jit could just key on that to enable spill suppression.

@sandreenko
Copy link
Author

It doesn't look right to create another workaround in addition to the existing one.
I am planning to do some experiments with it this week:

  1. cause stack overflow without spilling;
  2. find non-linear tree walking algorithms that will prevent us from disabling spilling if the first case is fixed (for example with rewriting tree walking functions with non-recursive versions).

@sandreenko sandreenko changed the title [WIP] Don't spill values on stack without a reason. Add test with deep execution tree without stores. Apr 9, 2017
@sandreenko sandreenko changed the title Add test with deep execution tree without stores. Do not spill eeStack after ldtoken opcode. Apr 9, 2017
@sandreenko
Copy link
Author

sandreenko commented Apr 9, 2017

I updated PR according to @JosephTremoulet proposition. As I understand it is not possible to cause stack overflow after this change. Even more we could forbid spill after all leaves opcodes.

@sandreenko sandreenko removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 10, 2017
@sandreenko
Copy link
Author

Created PR with deep tree #10835 to fail the initial PR.

Copy link

@JosephTremoulet JosephTremoulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with some comments.

If you wanted to mitigate against very long sequences of ldtokens, you could add a check that the opcode before ldtoken is dup (which would require tracking prevPrevOpcode or maybe lastDupOffset [similar to delegateCreateStart]), but I believe you're right it's not necessary.

//
// Return Value:
// true if it is legal, false if it could be a sequence that we do not want to divide.
bool Compiler::impCanSpillNow(OPCODE prevOpcode)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd to shadow the prevOpcode member with a prevOpcode parameter. Did you want to make this static? I'd lean toward making it parameterless.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prevOpcode is not member of compiler, do we want to make it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sorry, I was just misreading (I thought you were loading a member at the callsite, but I see now you're passing a local), it's good how you have it.

bool Compiler::impCanSpillNow(OPCODE prevOpcode)
{
StackEntry& lastSE = impStackTop(0);
GenTreePtr tree = lastSE.val;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like tree is unused (and then so is lastSE).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I though that it is possible to restore last opcode from the tree, but I was wrong.

GenTreePtr tree = lastSE.val;
if (prevOpcode == CEE_LDTOKEN)
{
return false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a comment referencing/explaining the InitializeArray sequence that we're avoiding breaking up to guarantee that impInitializeArrayIntrinsic can succeed.

{
return false;
}
return true;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this could be just return prevOpcode != CEE_LDTOKEN;

@@ -3102,6 +3102,10 @@ class Compiler

//---------------- Spilling the importer stack ----------------------------

// The maximum tree size in the importer before it spills all trees from the stack into local variables.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"maximum tree size..." is a bit misleading, as this is actually "maximum number of bytes of IL processed..."

Sergey Andreenko added 2 commits April 10, 2017 10:32
@sandreenko
Copy link
Author

I fixed comments, please look again.

Copy link

@JosephTremoulet JosephTremoulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@pgavlin pgavlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sandreenko sandreenko merged commit 5bc1e42 into dotnet:master Apr 11, 2017
@sandreenko sandreenko deleted the don't-spill-values-in-stack branch April 11, 2017 03:53
MichalStrehovsky added a commit to MichalStrehovsky/corert that referenced this pull request Apr 26, 2017
jkotas pushed a commit to dotnet/corert that referenced this pull request Apr 26, 2017
@karelz karelz modified the milestone: 2.0.0 Aug 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants