-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Display stack trace at stack overflow #31956
Display stack trace at stack overflow #31956
Conversation
This change enables printing stack trace at stack overflow to the console. In case multiple threads fail with stack overflow in parallel, only the trace of the first thread is printed.
cc: @stephentoub, @echesakovMSFT |
Here are examples of stack traces printed:
And here is a second example where each of the three functions has a local of a struct type where the struct is 64kB large. This tests the stack probing that JIT emits for frames larger than certain size.
|
Should we have a cut-off for this? |
I am not sure I understand what you mean. |
@@ -6703,6 +6707,9 @@ bool IsIPInMarkedJitHelper(UINT_PTR uControlPc) | |||
CHECK_RANGE(JIT_WriteBarrier) | |||
CHECK_RANGE(JIT_CheckedWriteBarrier) | |||
CHECK_RANGE(JIT_ByRefWriteBarrier) | |||
#if !defined(TARGET_ARM64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's special about ARM64 that it is not in this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARM64 doesn't have the JIT_StackProbe (yet). See #13519.
|
||
static volatile LONG g_stackOverflowCallStackLogged = 0; | ||
|
||
// Dump stack trace only for the first thread failing with stack overflow to prevent mixing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the other threads going to wait for this one to finish printing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've verified that. I've tried to put a 10 second sleep after printing the stack trace and it was still printed while the other threads were passing by.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the other threads hit stackoverflow exception as well? It may be useful to note in the comment where the other threads are going to wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was testing it with many threads hitting stack overflow at the same time. But I've just found that the threads wait only when running under a debugger. When running without a debugger, nothing is waiting for this thread to complete and so we get partial or even no stack trace dump here. So I'll need to fix it.
I mean that getting a log message with 174584 lines is not very useful. It would be nice to log e.g. first 1000 lines and last 1000 lines and skip the 172,000 lines in between with |
Ah ok, I see - you have implemented even something better. |
Will it also work if the loop is not begin from the top of the stack? Stack overflow.
at TestStackOverflow1.Program.InfiniteRecursionE()
at TestStackOverflow1.Program.InfiniteRecursionD()//`B` calls both `D` and then `C`, and overflow finally happen in `C`
Repeat 174584 times:
--------------------------------
at TestStackOverflow1.Program.InfiniteRecursionB()
at TestStackOverflow1.Program.InfiniteRecursionA()
at TestStackOverflow1.Program.InfiniteRecursionC()
--------------------------------
at TestStackOverflow1.Program.InfiniteRecursionB()
at TestStackOverflow1.Program.InfiniteRecursionA()
at TestStackOverflow1.Program.Main(System.String[]) |
Does this only search for repeats starting at the top frame? This strategy could get unlucky and not find any repeats if there is a non-recursive function call at the top. For example: A => B => A => B => ... => F
If I got this right, here's an idea on how to efficiently fix this: Take a histogram of functions involved in the stack. 99% of that histogram will be in the recursive part. That way it should be possible to efficiently determine the repeating section. It would be a heuristic that would essentially always work in practical cases. So first guess the repeating part from the histogram, then validate this assumption. Most of the time the validation succeeds and a shortened stack can be printed. |
@yyjdelete, @GSPP - My primary goal was to get any stack trace at stack overflow since without my change, all you'd get is a "Stack overflow" message. The secondary goal was to add some form of compaction, initially something that would work for cases of an infinite recursion and that would have reasonable performance. I believe this covers majority of the cases of bugs in the code flow.
@GSPP the histogram idea is interesting, thank you for the suggestion. It seems that there could still be cases where the functions on the top of the stack would occur somewhere else in the stack trace and mislead the detection. I guess you could still mitigate that somehow. It seems though that the Tarjan algorithm would solve it reliably for 100% of the cases. |
* Update the stack overflow stack size to use fixed size as a basis rather than basing it on the virtual page size. * Rename the AdjustContextForWriteBarrier to AdjustContextForJITHelpers * Add waiting for the stack overflow stack trace printing completion to the `EEPolicy::HandleFatalStackOverflow` so that it gets printed completely even in case of multiple threads failing with stack overflow.
src/coreclr/src/vm/eepolicy.cpp
Outdated
|
||
// Dump stack trace only for the first thread failing with stack overflow to prevent mixing | ||
// multiple stack traces together. | ||
CLREvent* event = new CLREvent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can throw exceptions, etc. It may be better to poll with Sleep(50)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I've added a new commit replacing the event by polling as you've suggested.
Hooray thank you for doing this @janvorli ! This will make it possible to diagnose these from a log file, which is a huge improvement. |
This reverts commit 65587ba.
Oof, I was looking forward to this. |
This change enables printing stack trace at stack overflow to the
console. In case multiple threads fail with stack overflow in parallel,
only the trace of the first thread is printed.
There are couple of interesting details:
pthread_attr_getstack
returns the address of the guard page in some of the glibc / distros and address of the last accessible page before the guard page on others. So I've relaxed the check to consider +/- 1 page around the stack limit to recognize sigsegv as stack overflow.