Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to debug StackOverflowException #9195

Open
Petermarcu opened this issue Oct 27, 2017 · 21 comments
Open

How to debug StackOverflowException #9195

Petermarcu opened this issue Oct 27, 2017 · 21 comments
Labels
area-ExceptionHandling-coreclr question Answer questions and provide assistance, not an issue with source code or documentation.
Milestone

Comments

@Petermarcu
Copy link
Member

@Daniel15 commented on Wed Oct 25 2017

I'm getting this error while moving a site from ASP.NET Core 1.1 on Mono to ASP.NET Core 2.0 on .NET Core 2.0:

dbug: Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker[2]
      Executed action method Daniel15.Web.Controllers.ShortUrlController.Index (Daniel15.Web), returned result Microsoft.AspNetCore.Mvc.ContentResult.
Process is terminating due to StackOverflowException.
[1]    12976 abort      LD_LIBRARY_PATH=/tmp/ssltest ASPNETCORE_ENVIRONMENT=Development =

How do I get a full stack trace for the StackOverflowException to determine where it's coming from?

@danmoseley
Copy link
Member

@janvorli your stack overflow work was in 2.0 I think.?

@ayende
Copy link
Contributor

ayende commented Jan 21, 2018

Any news about this? Any idea how to get at least some idea about what is going on?

@janvorli
Copy link
Member

The only thing that could work is to run the app under lldb and when it hits the stack overflow, load the libsosplugin.so and run "clrstack -f".

@ayende
Copy link
Contributor

ayende commented Jan 22, 2018

@janvorli Any suggestions for doing this on Windows?
We are trying with procdump right now.
The problem is that this is happening in production, and the kind of things we can do there are limited.

@cdmihai
Copy link
Contributor

cdmihai commented Jun 1, 2018

SO questions suggest either using windbg or reproing it in VS while debugging. This is a bit hard when the issue is hard to repro and happens in processes spawned by the entry process (or when it's not happening on windows). Just printing out the stack trace would be so helpful ...

@ayende
Copy link
Contributor

ayende commented Jun 1, 2018

@cdmihai Presumably at this point it would be hard to print the stack trace (there is no stack with which to work, after all).
But I want to join in and comment that anything would be good here. Having even a small portion of the stack trace should usually be enough to tell us what is recursing and narrow down investigation times considerably.

@patricksuo
Copy link

The only thing that could work is to run the app under lldb and when it hits the stack overflow, load the libsosplugin.so and run "clrstack -f".

@janvorli How do Microsoft dev debug this kind of bug in prod?
Not every bug can reproduce easily in the local environment.

@patricksuo
Copy link

Having even a small portion of the stack trace should usually be enough to tell us what is recursing and narrow down investigation times considerably.

This is exactly how Golang do. (In stacktrace below, I elide some frame manually)

supei@sandbox-dev-hk:~$ cat a.go
package main

func foo()() {
	foo()
}

func main(){
	foo()
}

supei@sandbox-dev-hk:~$ go run a.go
runtime: goroutine stack exceeds 1000000000-byte limit
fatal error: stack overflow

runtime stack:
runtime.throw(0x46d1a8, 0xe)
	/home/supei/go/src/runtime/panic.go:608 +0x72
runtime.newstack()
	/home/supei/go/src/runtime/stack.go:1008 +0x729
runtime.morestack()
	/home/supei/go/src/runtime/asm_amd64.s:429 +0x8f

goroutine 1 [running]:
main.foo()
	/home/supei/a.go:3 +0x2e fp=0xc020086378 sp=0xc020086370 pc=0x44e9fe
main.foo()
	/home/supei/a.go:4 +0x20 fp=0xc020086388 sp=0xc020086378 pc=0x44e9f0
main.foo()
	/home/supei/a.go:4 +0x20 fp=0xc020086398 sp=0xc020086388 pc=0x44e9f0
main.foo()
	/home/supei/a.go:4 +0x20 fp=0xc0200863a8 sp=0xc020086398 pc=0x44e9f0
main.foo()
	/home/supei/a.go:4 +0x20 fp=0xc020086998 sp=0xc020086988 pc=0x44e9f0
main.foo()
	/home/supei/a.go:4 +0x20 fp=0xc0200869a8 sp=0xc020086998 pc=0x44e9f0
...additional frames elided...
exit status 2

@ayende
Copy link
Contributor

ayende commented Oct 18, 2018

In other words, like the CoreCLR allocates an OutOfMemoryException instance upfront, we can allocate some space (1KB should be more than enough) and do that there?

@patricksuo
Copy link

Golang has dynamic (goroutine) stack which is in heap. Golang runtime grows/shrinks stack size as needed.
In the StackOverflow scenario, the runtime will preempt the goroutine just before it requires an abnormal stack growth.

I'm not familiar with dotnet. I guess managed code run on native thread stack.
Maybe thread stack guard page mechanism is sth could help.

@janvorli
Copy link
Member

I guess managed code run on native thread stack.

That's right. We already run sigsegv handler on an alternate stack to be able to at least print the message and not just silently die. This alternate stack is kept as small as possible since we need to allocate it for each thread. That size would likely not be enough to run the code that's necessary to dump the stack trace. But since we've recently switched to allocating the alternate stack space using mmap, we could actually reserve larger VM space and commit just the size needed by the regular sigsegv handling. On stack overflow, we could commit more of the space so that we have enough to dump the stack trace.
I've created #825 assigned to myself to track it.

@markusschaber
Copy link

markusschaber commented Nov 15, 2018

I currently have a problem where I cannot even get Stack Trace with Visual Studio debugger... So anything which could help us to get a clue would be welcome... :-)

[Edit: We solved this problem in the mean time via "print-debugging" - we used log entries to nail down the exact place where the code crashes, so it's not urgent any more...]

@facundofarias
Copy link

+1 :|

@BrunoJuchli
Copy link
Contributor

Does using windbg and SOS still work with core?

As described here: https://stackoverflow.com/a/49882734/684096

@fwanggg
Copy link

fwanggg commented Feb 28, 2019

That's right. We already run sigsegv handler on an alternate stack to be able to at least print the message and not just silently die. This alternate stack is kept as small as possible since we need to allocate it for each thread. That size would likely not be enough to run the code that's necessary to dump the stack trace. But since we've recently switched to allocating the alternate stack space using mmap, we could actually reserve larger VM space and commit just the size needed by the regular sigsegv handling. On stack overflow, we could commit more of the space so that we have enough to dump the stack trace.

Where is the stacktrace dumped to, standard err/output? I am debugging in an orchestrated containerized environment, when app crashes because of StackOverFlowException the containers goes away and all is left is stderr and stdout,
2019-02-28T14:33:34.98-0500 [APP/PROC/WEB/0] ERR Process is terminating due to StackOverflowException.
What's the best way to debug SOFE in this kind of environment.

@jhudsoncedaron
Copy link

Wait ... you're already outputting Process is terminating due to a StackOverflowException ... Too bad we can't walk down the frames and output them. This can be done in a constant amount of RAM.

@TehWardy
Copy link

TehWardy commented Jul 9, 2019

Got this from the console ...

Api> Route matched with {action = "Get", controller = "App"}. Executing controller action with signature Microsoft.AspNetCore.Mvc.IActionResult Get(Microsoft.AspNet.OData.Query.ODataQueryOptions`1[Core.Objects.Entities.CMS.App]) on controller Api.Controllers.AppController (Api).
Api>
Api> Process is terminating due to StackOverflowException.

Put a breakpoint in the action ... it's not getting that far ... so how do I debug stack overflows in DI ?

@daiplusplus
Copy link

daiplusplus commented May 26, 2021

I'd like to add that when running ASP.NET Core in an Azure App Service it's even more painful because the EventLog.xml file that Azure App Services maintains for you doesn't record any mention of the process being killed due to a stack-overflow. That's maddening. This means that every unexpected stack-overflow causes 2-3 hours of figuring out "why isn't the website working?" because there's no indication the entire process is crashing in the first place.

It seems in Azure the only solution is to enable short-term crash monitoring, then reproduce the issue (assuming you can even consistently and reliably reproduce it in the first place!), then download the multi-gigabyte-sized .dmp file that Azure Portal saves to your blob storage account, and then wait over 30 minutes for Visual Studio to chew through the .dmp file (all while VS shows an ugly pop-up informing me that a background process is "taking too long" and only giving me a (very tempting) "Terminate" button...

So I'd describe the issue more broadly as: the overall developer UX for diagnosing and investigating stack-overflow crashes in .NET Core is abysmal and this is especially disappointing given Microsoft has a generally good reputation for developer-tooling - and we never had this problem in .NET Framework 1.x, where we could at least catch( StackOverflowException ).


Out of curiosity (and I know it's off-topic), but why doesn't EventLog.xml record app-crashes due to stack-overflows?

@danmoseley
Copy link
Member

@tommcdon do you know who writes this xml file? The work @janvorli did to emit the stack was a game changer but it sounds like the scenario doesn't quite work E2E here.

@tommcdon
Copy link
Member

Eventlog.xml is part of the Application Event Log feature in Azure App Services. I'll find out the owners and try out the E2E scenario with StackOverFlow. It sounds like we might have a scenario gap here.

@jhudsoncedaron
Copy link

Or you can fix #8947 to at least allow catch (StackOverflowException) to work. The original reason for denying it is long gone.

StackOverflow has been theoretically recoverable forever. Once having rolled back 4k of stack you can call the native function _resetstkoflw https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/resetstkoflw?view=msvc-160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-ExceptionHandling-coreclr question Answer questions and provide assistance, not an issue with source code or documentation.
Projects
None yet
Development

No branches or pull requests