Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dump of StackOverflowException does not provide any information about the source of the exception #12752

Closed
andrii-litvinov opened this issue May 25, 2019 · 9 comments
Assignees
Milestone

Comments

@andrii-litvinov
Copy link

We are moving to containerized environment hosted with Docker Swarm and I want to understand how to track down StackOverflowException if it occurs.

I create simple application and build it in a container in Debug mode to avoid tail recursion optimization and force StackOverflowException.

public class Program
{
    private static async Task Main() => Do();
    static void Do() => Do();
}
FROM microsoft/dotnet:2.2-sdk AS build
WORKDIR /app
COPY Demo/Project.csproj Demo/
RUN dotnet restore ./Demo/Project.csproj
RUN dotnet publish Demo --configuration Debug --output ../out

FROM microsoft/dotnet:2.2-aspnetcore-runtime AS runtime
WORKDIR /app
COPY Demo/entrypoint.sh ./
RUN chmod +x ./entrypoint.sh
ENTRYPOINT ["./entrypoint.sh", "dotnet", "Demo.dll"]

Entry point script to set ulimit core to unlimited to enable core dump creation, because Docker Swarm does not support passing it in:

#!/bin/bash

ulimit -c unlimited

# print configuration values required for core dump creation
echo limits: $(ulimit -a)
echo kernel.core_pattern: $(cat /proc/sys/kernel/core_pattern)

$@

I build the container and try to analyze the dump:

λ docker build . -t stackoverflow --pull
λ docker run --name stackoverflow stackoverflow
λ docker commit stackoverflow stackoverflow1
λ docker run -it --rm --entrypoint /bin/bash stackoverflow1

root@10a256a59228:/app# apt update && apt install -y lldb-3.9
root@10a256a59228:/app# lldb-3.9 -O "settings set target.exec-search-paths /usr/share/dotnet/shared/Microsoft.NETCore.App/2.2.5/" \
 -O "plugin load /usr/share/dotnet/shared/Microsoft.NETCore.App/2.2.5/libsosplugin.so" \
 --core ./core \
 $(which dotnet)

(lldb) clrthread
ThreadCount:      3
UnstartedThread:  0
BackgroundThread: 2
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                                                                                        Lock
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   1    1    8 0000000000879AF0    20020 Cooperative 00007F46988A3CD8:00007F46988A3FD0 00000000008782F0 0     Ukn
   9    2   12 0000000000824790    21220 Preemptive  0000000000000000:0000000000000000 00000000008782F0 0     Ukn (Finalizer)
  10    3   13 00007F457C0009F0  1020220 Preemptive  0000000000000000:0000000000000000 00000000008782F0 0     Ukn (Threadpool Worker)

(lldb) clrstack
OS Thread Id: 0x8 (1)
        Child SP               IP Call Site
00007FFC9D78F448 00007f4838419fff [GCFrame: 00007ffc9d78f448]
00007FFC9D78F850 00007f4838419fff [GCFrame: 00007ffc9d78f850]

(lldb) clrstack -f
OS Thread Id: 0x8 (1)
        Child SP               IP Call Site
00007F483965C798 00007F4838419FFF libc.so.6!gsignal + 207
00007F483965C830 00007F483841B42A libc.so.6!abort + 362
00007F483965C960 00007F4837B55A23 libcoreclr.so + -1
00007F483965C970 00007F4837B1E5E2 libcoreclr.so + -1
00007F483965C9C0 00007F48390340E0 libpthread.so.0!__restore_rt
00007FFC9CF92000 00007F4837833834 libcoreclr.so!AllocateString_MP_FastPortable(unsigned int) + 4
00007FFC9CF92010 00007F47BE799BF2
00007FFC9CF92050 00007F47BE79991B
00007FFC9CF92160 00007F47BE799808
00007FFC9CF92180 00007F47BE79977A
00007FFC9CF921C0 00007F47BE79979E
00007FFC9CF92200 00007F47BE79979E
00007FFC9CF92240 00007F47BE79979E
00007FFC9CF92280 00007F47BE79979E
00007FFC9CF922C0 00007F47BE79979E
00007FFC9CF92300 00007F47BE79979E
00007FFC9CF92340 00007F47BE79979E
00007FFC9CF92380 00007F47BE79979E
00007FFC9CF923C0 00007F47BE79979E
...
00007FFC9CFA1940 00007F47BE79979E
00007FFC9D78F448                  [GCFrame: 00007ffc9d78f448]
00007FFC9D78F850                  [GCFrame: 00007ffc9d78f850]

Neither clrstack nor clrstack -f shows useful information about the source of StackOverflowException. Though it is obvious that repeated lines with 00007F47BE79979E are due to StackOverflowException. I am new to memory dump debugging especially with lldb inside Docker container. Am I missing something? I have tried to collect a minidump by setting

ENV COMPlus_DbgEnableMiniDump=1 \
COMPlus_DbgMiniDumpType=4

to see if it differs even though it is not possible to run container with --cap-add=SYS_PTRACE in Swarm, but it shows same result and the dump itself was about 12GB which is quite big size for a minidump.

On a side note, I tried to see if I can get some more details of regular exception in core dumps created on crash and I haven't found any symbols. It shows !Unknown instead of symbol name:

(lldb) pe
Exception object: 00007f773c017ef0
Exception type:   <Unknown>
Message:          Bang.
InnerException:   <none>
StackTrace (generated):
    SP               IP               Function
    00007FFF12DE84F0 00007F7869D49666 Demo.dll!Unknown+0x86
    00007FFF12DE85A0 00007F7869D4CB3F System.Private.CoreLib.dll!Unknown+0x1f
    00007FFF12DE85B0 00007F7869D4C2F7 System.Private.CoreLib.dll!Unknown+0x57
    00007FFF12DE85D0 00007F7869D48CD4 Demo.dll!Unknown+0x54

versus full minidump

(lldb) pe
Exception object: 00007f773c017ef0
Exception type:   System.Exception
Message:          Bang.
InnerException:   <none>
StackTrace (generated):
    SP               IP               Function
    00007FFF12DE84F0 00007F7869D49666 Demo.dll!Demo.Program+<Main>d__0.MoveNext()+0x86
    00007FFF12DE85A0 00007F7869D4CB3F System.Private.CoreLib.dll!System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()+0x1f
    00007FFF12DE85B0 00007F7869D4C2F7 System.Private.CoreLib.dll!System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)+0x57
    00007FFF12DE85D0 00007F7869D48CD4 Demo.dll!Demo.Program.<Main>()+0x54

Is there any possibility to view symbols for regular core dumps because there is currently no way to use minidumps in Docker Swarm because of lack of support of --cap-add=SYS_PTRACE and also they are very heavy?

@jkotas
Copy link
Member

jkotas commented May 25, 2019

cc @mikem8361

@mikem8361 mikem8361 self-assigned this May 25, 2019
@mikem8361
Copy link
Member

  1. The size of the "full" core dump (COMPlus_DbgMiniDumpType=4) has been fixed in the latest coreclr (will be in Preview 6 of 3.0), but you should NOT need that to get good stack trace information.
  2. The !UNKNOWN problems with minidumps has been fixed in the "new" SOS out of the diagnostics repo (see https://github.com/dotnet/diagnostics#installing-sos on how to install it). It will work with your 2.2.x version of the runtime (this SOS is no longer tied to the runtime version).
  3. This new version of SOS will also automatically download the symbols (both native and managed) for the core dump image. See the new setsymbolserver command. The SOS installer will configure SOS to automatically load in lldb and enable this symbol download support.
  4. Yes you will need -cap-add=SYS_PTRACE in a docker container no matter how you generate the core dumps (via COMPlus_DbgEnableMiniDump=1 or setting up Linux to generate a system core dump).

@mikem8361
Copy link
Member

I can repo the problem that clrstack doesn't display any method info about the managed frames. I'll continue to investigate this problem.

frame #0: 0x00007fc1d6e99b80 0x00007fc1d6a6af7b libpthread.so.0`__waitpid + 107
    frame dotnet/coreclr#1: 0x00007fc1d6e99bb0 0x00007fc1d53ae9ee libcoreclr.so`::PROCCreateCrashDumpIfEnabled() + 110 at process.cpp:3242
    frame dotnet/coreclr#2: 0x00007fc1d6e99be0 0x00007fc1d53aea1e libcoreclr.so`::PROCAbort() + 14 at process.cpp:3268
    frame dotnet/coreclr#3: 0x00007fc1d6e99bf0 0x00007fc1d53775e2 libcoreclr.so`sigsegv_handler(code=<unavailable>, siginfo=<unavailable>, context=<unavailable>) + 338 at signal.cpp:379
    frame dotnet/coreclr#4: 0x00007fc1d6e99c40 0x00007fc1d6a6b390 libpthread.so.0`___lldb_unnamed_symbol1$$libpthread.so.0 + 1
    frame dotnet/coreclr#5: 0x00007fff57303000 0x00007fc15bf71d68
    frame dotnet/coreclr#6: 0x00007fff57303010 0x00007fc15bf71d6d
    frame dotnet/coreclr#7: 0x00007fff57303020 0x00007fc15bf71d6d
    frame dotnet/coreclr#8: 0x00007fff57303030 0x00007fc15bf71d6d
    frame dotnet/coreclr#9: 0x00007fff57303040 0x00007fc15bf71d6d
(lldb) ip2md 0x00007fc15bf71d68
MethodDesc:   00007fc15b2e5820
Method Name:          stackover.Program.Do()
Class:                00007fc15c041088
MethodTable:          00007fc15b2e5848
mdToken:              0000000006000002
Module:               00007fc15b2e4418
IsJitted:             yes
Current CodeAddr:     00007fc15bf71d50
Code Version History:
  CodeAddr:           00007fc15bf71d50  (Non-Tiered)
  NativeCodeVersion:  0000000000000000
Source file:  /home/mikem/builds/stackover/Program.cs @ 11
(lldb) ip2md 0x00007fc15bf71d6d
MethodDesc:   00007fc15b2e5820
Method Name:          stackover.Program.Do()
Class:                00007fc15c041088
MethodTable:          00007fc15b2e5848
mdToken:              0000000006000002
Module:               00007fc15b2e4418
IsJitted:             yes
Current CodeAddr:     00007fc15bf71d50
Code Version History:
  CodeAddr:           00007fc15bf71d50  (Non-Tiered)
  NativeCodeVersion:  0000000000000000

@mikem8361
Copy link
Member

The managed stack unwinder (in the DAC) is having problems with the stack overflow probably because we are on an alternate stack when the process aborts on an overflow.

@mikem8361
Copy link
Member

See issue: dotnet/diagnostics#66

@andrii-litvinov
Copy link
Author

@mikem8361 thank you for detailed explanation!

I do the following in the container I run from the committed image with dump.

Install dotnet-sdk:

apt update && apt install -y wget gpg; \
    wget -qO- https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor > microsoft.asc.gpg; \
    mv microsoft.asc.gpg /etc/apt/trusted.gpg.d/; \
    wget -q https://packages.microsoft.com/config/debian/9/prod.list; \
    mv prod.list /etc/apt/sources.list.d/microsoft-prod.list; \
    chown root:root /etc/apt/trusted.gpg.d/microsoft.asc.gpg; \
    chown root:root /etc/apt/sources.list.d/microsoft-prod.list

apt-get install -y apt-transport-https; \
    apt-get update; \
    apt-get install -y dotnet-sdk-2.2

Install dotnet-sos:

dotnet tool install -g dotnet-sos --version 1.0.3-preview5.19251.2

And then I get following when I try to install sos:

dotnet sos install
No executable found matching command "dotnet-sos"

So instead I have to run

~/.dotnet/tools/dotnet-sos install

Then no matter whether I use linux core dump with or without -cap-add=SYS_PTRACE or minidump with only COMPlus_DbgEnableMiniDump=1 set I see no symbols in the lldb:

root@3a286c492701:/app# lldb-3.9 $(which dotnet) --core ./core
(lldb) target create "/usr/bin/dotnet" --core "./core"
Core file '/app/./core' (x86_64) was loaded.
(lldb) clrstack
OS Thread Id: 0x8 (1)
        Child SP               IP Call Site
00007FFF188799E0 00007f91c509efff [HelperMethodFrame: 00007fff188799e0]
00007FFF18879B60 00007F914B42D0AF /usr/share/dotnet/shared/Microsoft.NETCore.App/2.2.5/System.Private.CoreLib.dll!Unknown
00007FFF18879B70 00007F914B42CC77 /usr/share/dotnet/shared/Microsoft.NETCore.App/2.2.5/System.Private.CoreLib.dll!Unknown
00007FFF18879B90 00007F914B4290A3 /app/Demo.dll!Unknown
00007FFF18879E58 00007f91c454f17f [GCFrame: 00007fff18879e58]
00007FFF1887A260 00007f91c454f17f [GCFrame: 00007fff1887a260]
(lldb) exit

root@3a286c492701:/app# lldb-3.9 $(which dotnet) --core /tmp/coredump.8
(lldb) target create "/usr/bin/dotnet" --core "/tmp/coredump.8"
Core file '/tmp/coredump.8' (x86_64) was loaded.
(lldb) setsymbolserver -ms
Added Microsoft public symbol server
(lldb) clrstack
OS Thread Id: 0x8 (1)
        Child SP               IP Call Site
00007FFF188799E0 00007f91c5cb8b5a [HelperMethodFrame: 00007fff188799e0]
00007FFF18879B60 00007F914B42D0AF /usr/share/dotnet/shared/Microsoft.NETCore.App/2.2.5/System.Private.CoreLib.dll!Unknown
00007FFF18879B70 00007F914B42CC77 /usr/share/dotnet/shared/Microsoft.NETCore.App/2.2.5/System.Private.CoreLib.dll!Unknown
00007FFF18879B90 00007F914B4290A3 /app/Demo.dll!Unknown
00007FFF18879E58 00007f91c454f17f [GCFrame: 00007fff18879e58]
00007FFF1887A260 00007f91c454f17f [GCFrame: 00007fff1887a260]
(lldb) clrthreads
ThreadCount:      3
UnstartedThread:  0
BackgroundThread: 2
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                                                                                        Lock
 DBG   ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   1    1    8 0000000001A81820    20020 Preemptive  00007F90240459D0:00007F9024045B30 0000000001ACF4F0 0     Ukn <Invalid Object> (00007f902402dde8)
   9    2   12 0000000001AD1D00    21220 Preemptive  0000000000000000:0000000000000000 0000000001ACF4F0 0     Ukn (Finalizer)
  10    3   13 00007F8F0C0009F0  1020220 Preemptive  0000000000000000:0000000000000000 0000000001ACF4F0 0     Ukn (Threadpool Worker)
(lldb) setsymbolserver
Server: http://msdl.microsoft.com/download/symbols/

@mikem8361
Copy link
Member

I forgot that the "missing metadata" fix that causes the !UNKNOWN in clrstack, etc wasn't fixed in the preview5 sos. If you install version 1.0.4-preview6.19272.1 of dotnet-sos and re-install, it will fix this problem. This version or one really close will be the "preview6" version of SOS and our global tools.

dotnet tool install -g dotnet-sos --version 1.0.4-preview6.19272.1 --add-source https://dotnetfeed.blob.core.windows.net/dotnet-core/index.json

And yes, you will either add "$HOME/.dotnet/tools" to your PATH or execute the tool the way you did.

@andrii-litvinov
Copy link
Author

I played more with it and all works good with dotnet-sos 1.0.4-preview6 and minidumps created with -cap-add=SYS_PTRACE option and COMPlus_DbgEnableMiniDump=1 set. Thank you for the guidance!

However I still cannot see symbols in core dumps created by Linux kernel. Will native core dumps ever be supporter by SOS? Or will it only work with minidumps created by CLR? If not then I guess that's another reason to use k8s over swarm.

Do I understand correctly that issue with StackOverflow will be fixed as part of dotnet/diagnostics#66 issue?

@mikem8361
Copy link
Member

mikem8361 commented Jun 4, 2019 via email

@tommcdon tommcdon closed this as completed Jun 5, 2019
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants