Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider not using STACK_SIZE_PARAM_IS_A_RESERVATION anymore #45971

Open
jhudsoncedaron opened this issue Dec 11, 2020 · 3 comments
Open

Consider not using STACK_SIZE_PARAM_IS_A_RESERVATION anymore #45971

jhudsoncedaron opened this issue Dec 11, 2020 · 3 comments

Comments

@jhudsoncedaron
Copy link

jhudsoncedaron commented Dec 11, 2020

Description

We've been dealing with issues semi-periodically where someone will run the application server out of memory and hang or crash the application in interesting ways. In response (in order to ensure we can get a trace on any such crash) we've been systematically eliminating any ways memory allocation failures can bring the application down as these were either server hangs or meaningless crashes because some backgrounded task had no catchall at top level.

I think we've finally located the last one.

New managed threads are created in https://github.com/dotnet/runtime/blob/69e114c1abf91241a0eeecf1ecceab4711b8aa62/src/coreclr/vm/threads.cpp
or possibly https://github.com/dotnet/runtime/blob/69e114c1abf91241a0eeecf1ecceab4711b8aa62/src/coreclr/vm/threads.cpp both of which use STACK_SIZE_PARAM_IS_A_RESERVATION .

The consequence of STACK_SIZE_PARAM_IS_A_RESERVATION is out of memory can in rare cases trigger an essentially randomly located stack overflow rather than a more reasonable OutOfMemoryException. As I said, we've been systematically tracking down and ensuring that OutOfMemoryException is handled. But we can't handle stack overflow exceptions from random places in the code.

Finally got sent this beauty:

Faulting application name: Cedaron.xxxxxxxxxxxxxx.exe, version: 10.0.1405.10004, time stamp: 0x5e597f86
Faulting module name: KERNELBASE.dll, version: 6.3.9600.19671, time stamp: 0x5e673d0b
Exception code: 0xc00000fd
Fault offset: 0x000000000000b3f7
Faulting process id: 0xde0
Faulting application start time: 0x01d6cda83754f5b2
Faulting application path: C:\Program Files\Cedaron xxxxxxxxxxxxxxxxxxx.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll
Report Id: 7d55039e-3bda-11eb-8106-005056bf56da
Faulting package full name:
Faulting package-relative application ID:

The application is 100% managed code, and it died with a native stack overflow. I would expect that stack memory is already allocated at thread creation, but it just isn't leading to the crashes being untraceable and unactionable. We can't even distinguish legit stackoverflows due to bugs in recursive functions from someone didn't give it enough memory again.

For the obvious reason, I can't include a small repro. Do you really want a memory bomb demo?

We are currently running net core 3.1, but the issue has existed for quite some time (potentially always existed) and still exists in trunk.

The only workaround I can imagine is hooking CreateThread and squashing the flag. Would be easier to build the runtime locally and remove the flag from the call sites.

Tangentially related to #8947 but fixing that will no longer help us.

@danmoseley
Copy link
Member

@janvorli

@janvorli
Copy link
Member

It seems that in majority of cases, the STACK_SIZE_PARAM_IS_A_RESERVATION is a good thing, as most threads would use just a small portion of the reserved stack space. If we stopped passing it, applications using e.g. many threadpool threads would crash with OOM even though they would never need to use that amount of memory.
So I believe removing the STACK_SIZE_PARAM_IS_A_RESERVATION by default would hurt more than help.
Moreover, on Unix, we don't have a way to control that, stack allocation is always a reservation (unless we would go and touch every stack page at the thread creation time).
However, maybe we could add a new configuration setting that would cause the STACK_SIZE_PARAM_IS_A_RESERVATION to not to be passed in for the special cases like yours.

@jhudsoncedaron
Copy link
Author

@janvorli Unix systems raise bus error instead of stack overflow when the stack memory can't be allocated. This makes the cases distinguishable. But if I turned off overcommit I would observe the memory being reserved immediately on clone() (not just the address space--this reserves space in the swap file, thus guaranteeing the demand can be met). But I'm pretty sure can't create thread in the threadpool is tolerated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants