Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing call site when getting stack trace on stack overflow on Linux #35391

Closed
k15tfu opened this issue Apr 24, 2020 · 9 comments
Closed

Missing call site when getting stack trace on stack overflow on Linux #35391

k15tfu opened this issue Apr 24, 2020 · 9 comments
Labels
area-Diagnostics-coreclr question Answer questions and provide assistance, not an issue with source code or documentation. tracking This issue is tracking the completion of other related issues.
Milestone

Comments

@k15tfu
Copy link
Contributor

k15tfu commented Apr 24, 2020

Hi!

Not sure why I can get it to work because similar issues were already considered a few times (#825, #9195, #12752, dotnet/diagnostics#135).

my Program.cs:

using System;
using System.Diagnostics;
using System.Threading;

namespace stackoverflow_csharp
{
    class Program
    {
        private const int MAX_RECURSIVE_CALLS = 1000000;
        static int ctr = 0;

        //
        public static void Main()
        {
            Console.WriteLine("PID: {0}", Process.GetCurrentProcess().Id);
            Console.WriteLine("Sleeping 5 seconds...");
            Thread.Sleep(5000);

            Program ex = new Program();
            try
            {
                ex.Execute();
            }
            catch (StackOverflowException e)
            {
                Console.WriteLine("\nstofl: {0}", ctr);
                throw;
            }

            Console.WriteLine("\nThe call counter: {0}", ctr);
        }

        private unsafe void Execute(string arg1 = "")
        {
            long* bar = stackalloc long [1000];
            ctr++;
            if (ctr % 50 == 0)
                Console.WriteLine("Call number {0} to the Execute method", ctr);

            if (ctr <= MAX_RECURSIVE_CALLS)
                Execute(string.Format("{0}", (IntPtr)bar));

            ctr--;
        }
    }
}

I attach to this program during Thread.Sleep() and when lldb-8 breaks on SIGSEGV I'm able to get the full stack but it doesn't work when stopped in sigsegv_handler / PROCAbort:

$ lldb-8 -p 2702
(lldb) process attach --pid 2702
Process 2702 stopped
* thread #1, name = 'dotnet', stop reason = signal SIGSTOP
    frame #0: 0x00007fa3881c6ed9 libpthread.so.0`__pthread_cond_timedwait at futex-internal.h:142
  thread #2, name = 'dotnet-ust', stop reason = signal SIGSTOP
    frame #0: 0x00007fa3875a4839 libc.so.6`syscall at syscall.S:38
  thread #3, name = 'dotnet-ust', stop reason = signal SIGSTOP
    frame #0: 0x00007fa3875a4839 libc.so.6`syscall at syscall.S:38
  thread #4, name = 'dotnet', stop reason = signal SIGSTOP
    frame #0: 0x00007fa38759dbf9 libc.so.6`__GI___poll(fds=0x00007fa384b07db8, nfds=1, timeout=-1) at poll.c:29
  thread #5, name = 'dotnet', stop reason = signal SIGSTOP
    frame #0: 0x00007fa3881cadae libpthread.so.0`__libc_open64(file="/tmp/clr-debug-pipe-2702-17520040-in", oflag=0) at open64.c:47
  thread #6, name = 'dotnet', stop reason = signal SIGSTOP
    frame #0: 0x00007fa3881c69f3 libpthread.so.0`__pthread_cond_wait at futex-internal.h:88
  thread #7, name = 'dotnet', stop reason = signal SIGSTOP
    frame #0: 0x00007fa3881c6ed9 libpthread.so.0`__pthread_cond_timedwait at futex-internal.h:142
  thread #8, name = 'dotnet', stop reason = signal SIGSTOP
    frame #0: 0x00007fa3881ca6d7 libpthread.so.0`__libc_accept(fd=11, addr=__SOCKADDR_ARG @ rbp, len=0x00007fa2e7ffcda4) at accept.c:26
  thread #9, name = 'dotnet', stop reason = signal SIGSTOP
    frame #0: 0x00007fa3881c69f3 libpthread.so.0`__pthread_cond_wait at futex-internal.h:88
  thread #10, name = 'dotnet', stop reason = signal SIGSTOP
    frame #0: 0x00007fa3881c6ed9 libpthread.so.0`__pthread_cond_timedwait at futex-internal.h:142

Executable module set to "/usr/share/dotnet/dotnet".
Architecture set to: x86_64-pc-linux-gnu.
(lldb) plugin load /home/user/.dotnet/sos/libsosplugin.so
(lldb) b sigsegv_handler
Breakpoint 1: 2 locations.
(lldb) b PROCAbort
Breakpoint 2: 2 locations.
(lldb) c
Process 2702 resuming
Process 2702 stopped
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV: invalid address (fault address: 0x7fffab49aff8)
    frame #0: 0x00007fa30d6a8b48
->  0x7fa30d6a8b48: pushq  $0x0
    0x7fa30d6a8b4a: pushq  $0x0
    0x7fa30d6a8b4c: decq   %rdi
    0x7fa30d6a8b4f: jne    0x7fa30d6a8b48
(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV: invalid address (fault address: 0x7fffab49aff8)
  * frame #0: 0x00007fa30d6a8b48
    frame #1: 0x00007fa30d6a8cbf
    frame #2: 0x00007fa30d6a8cbf
[...]
    frame #1033: 0x00007fa30d6a179e
    frame #1034: 0x00007fa386a98d1f libcoreclr.so`CallDescrWorkerInternal + 124
    frame #1035: 0x00007fa3869c90b5 libcoreclr.so`CallDescrWorkerWithHandler(CallDescrData*, int) + 117
    frame #1036: 0x00007fa3869c9b90 libcoreclr.so`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1264
    frame #1037: 0x00007fa386aad692 libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) + 786
    frame #1038: 0x00007fa386aad990 libcoreclr.so`Assembly::ExecuteMainMethod(PtrArray**, int) + 272
    frame #1039: 0x00007fa386915f11 libcoreclr.so`CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*) + 577
    frame #1040: 0x00007fa3868eed24 libcoreclr.so`coreclr_execute_assembly + 212
    frame #1041: 0x00007fa386ff22f3 libhostpolicy.so`___lldb_unnamed_symbol111$$libhostpolicy.so + 755
    frame #1042: 0x00007fa386ff26b4 libhostpolicy.so`___lldb_unnamed_symbol112$$libhostpolicy.so + 68
    frame #1043: 0x00007fa386ff2d8c libhostpolicy.so`corehost_main + 204
    frame #1044: 0x00007fa387257ecc libhostfxr.so`___lldb_unnamed_symbol386$$libhostfxr.so + 1532
    frame #1045: 0x00007fa387256a63 libhostfxr.so`___lldb_unnamed_symbol384$$libhostfxr.so + 691
    frame #1046: 0x00007fa387251da2 libhostfxr.so`hostfxr_main + 146
    frame #1047: 0x000000000040fd9a dotnet`run(int, char const**) + 618
    frame #1048: 0x000000000040feee dotnet`main + 158
    frame #1049: 0x00007fa3874aab97 libc.so.6`__libc_start_main(main=(dotnet`main), argc=3, argv=0x00007fffabc98b18, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffabc98b08) at libc-start.c:310
    frame #1050: 0x000000000040dcda dotnet`_start + 42
(lldb) clrstack
OS Thread Id: 0xa8e (1)
        Child SP               IP Call Site
00007FFFAB49B000 00007FA30D6A8B48 stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 35]
00007FFFAB49BEB0 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFAB49DE60 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFAB49FE10 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFAB4A1DC0 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFAB4A3D70 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFAB4A5D20 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFAB4A7CD0 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFAB4A9C80 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFAB4ABC30 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFAB4ADBE0 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
[...]
00007FFFABC93CD0 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFABC95C80 00007FA30D6A8CBF stackoverflow_csharp.Program.Execute(System.String) [/home/user/p/test/stackoverflow_csharp/Program.cs @ 41]
00007FFFABC97C30 00007FA30D6A179E stackoverflow_csharp.Program.Main() [/home/user/p/test/stackoverflow_csharp/Program.cs @ 22]
00007FFFABC97F88 00007fa386a98d1f [GCFrame: 00007fffabc97f88]
00007FFFABC98470 00007fa386a98d1f [GCFrame: 00007fffabc98470]
(lldb) c
Process 2702 resuming
Process 2702 stopped
* thread #1, name = 'dotnet', stop reason = breakpoint 1.1
    frame #0: 0x00007fa386c98550 libcoreclr.so`sigsegv_handler(int, siginfo_t*, void*)
libcoreclr.so`sigsegv_handler:
->  0x7fa386c98550 <+0>: pushq  %rbp
    0x7fa386c98551 <+1>: movq   %rsp, %rbp
    0x7fa386c98554 <+4>: pushq  %r15
    0x7fa386c98556 <+6>: pushq  %r14
(lldb) bt
* thread #1, name = 'dotnet', stop reason = breakpoint 1.1
  * frame #0: 0x00007fa386c98550 libcoreclr.so`sigsegv_handler(int, siginfo_t*, void*)
    frame #1: 0x00007fa3881cb890 libpthread.so.0`___lldb_unnamed_symbol1$$libpthread.so.0 + 1
    frame #2: 0x00007fa30d6a8b48
    frame #3: 0x00007fa30d6a8cbf
    frame #4: 0x00007fa30d6a8cbf
[...]
    frame #1051: 0x00007fa3874aab97 libc.so.6`__libc_start_main(main=(dotnet`main), argc=3, argv=0x00007fffabc98b18, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffabc98b08) at libc-start.c:310
    frame #1052: 0x000000000040dcda dotnet`_start + 42
(lldb) clrstack
OS Thread Id: 0xa8e (1)
        Child SP               IP Call Site
00007FFFABC97F88 00007fa386c98550 [GCFrame: 00007fffabc97f88]
00007FFFABC98470 00007fa386c98550 [GCFrame: 00007fffabc98470]
(lldb) detach
Process 2702 detached
(lldb) q

Consequently, I see the same thing when analyzing auto-generated minidump by createdump:

$ export COMPlus_DbgEnableMiniDump=1
$ dotnet exec /home/user/p/test/stackoverflow_csharp/bin/Debug/netcoreapp3.1/stackoverflow_csharp.dll
PID: 2702
Sleeping 5 seconds...
Call number 50 to the Execute method
Call number 100 to the Execute method
[...]
Call number 950 to the Execute method
Call number 1000 to the Execute method
Stack overflow.
Writing minidump with heap to file /tmp/coredump.2702
Written 86708224 bytes (21169 pages) to core file
Aborted (core dumped)

I have tried dotnet-dump analyze and lldb-8 -c:

$ dotnet-dump analyze /tmp/coredump.2702
Loading core dump: /tmp/coredump.2702 ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
> clrstack
OS Thread Id: 0xa8e (0)
        Child SP               IP Call Site
00007FFFABC97F88 00007fa3881cb23a [GCFrame: 00007fffabc97f88]
00007FFFABC98470 00007fa3881cb23a [GCFrame: 00007fffabc98470]
> q
~$ lldb-8 `which dotnet` -c /tmp/coredump.2702
(lldb) target create "/usr/bin/dotnet" --core "/tmp/coredump.2702"
Core file '/tmp/coredump.2702' (x86_64) was loaded.
(lldb) plugin load /home/user/.dotnet/sos/libsosplugin.so
(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGABRT
  * frame #0: 0x00007fa3881cb23a libpthread.so.0`__waitpid(pid=2854, stat_loc=0x00007fa3887ffa2c, options=0) at waitpid.c:30
    frame #1: 0x00007fa386ccae9d libcoreclr.so`PROCCreateCrashDump(char**) + 125
    frame #2: 0x00007fa386cc9bfe libcoreclr.so`PROCAbort + 14
    frame #3: 0x00007fa386c986eb libcoreclr.so`sigsegv_handler(int, siginfo_t*, void*) + 411
    frame #4: 0x00007fa3881cb890 libpthread.so.0`___lldb_unnamed_symbol1$$libpthread.so.0 + 1
    frame #5: 0x00007fa30d6a8b48
    frame #6: 0x00007fa30d6a8cbf
    frame #7: 0x00007fa30d6a8cbf
[...]
    frame #1038: 0x00007fa30d6a179e
    frame #1039: 0x00007fa386a98d1f libcoreclr.so`CallDescrWorkerInternal + 124
    frame #1040: 0x00007fa3869c90b5 libcoreclr.so`CallDescrWorkerWithHandler(CallDescrData*, int) + 117
    frame #1041: 0x00007fa3869c9b90 libcoreclr.so`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1264
    frame #1042: 0x00007fa386aad692 libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) + 786
    frame #1043: 0x00007fa386aad990 libcoreclr.so`Assembly::ExecuteMainMethod(PtrArray**, int) + 272
    frame #1044: 0x00007fa386915f11 libcoreclr.so`CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*) + 577
    frame #1045: 0x00007fa3868eed24 libcoreclr.so`coreclr_execute_assembly + 212
    frame #1046: 0x00007fa386ff22f3 libhostpolicy.so`___lldb_unnamed_symbol111$$libhostpolicy.so + 755
    frame #1047: 0x00007fa386ff26b4 libhostpolicy.so`___lldb_unnamed_symbol112$$libhostpolicy.so + 68
    frame #1048: 0x00007fa386ff2d8c libhostpolicy.so`corehost_main + 204
    frame #1049: 0x00007fa387257ecc libhostfxr.so`___lldb_unnamed_symbol386$$libhostfxr.so + 1532
    frame #1050: 0x00007fa387256a63 libhostfxr.so`___lldb_unnamed_symbol384$$libhostfxr.so + 691
    frame #1051: 0x00007fa387251da2 libhostfxr.so`hostfxr_main + 146
    frame #1052: 0x000000000040fd9a dotnet`run(int, char const**) + 618
    frame #1053: 0x000000000040feee dotnet`main + 158
    frame #1054: 0x00007fa3874aab97 libc.so.6`__libc_start_main(main=(dotnet`main), argc=3, argv=0x00007fffabc98b18, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffabc98b08) at libc-start.c:310
    frame #1055: 0x000000000040dcda dotnet`_start + 42
(lldb) clrstack
OS Thread Id: 0xa8e (1)
        Child SP               IP Call Site
00007FFFABC97F88 00007fa3881cb23a [GCFrame: 00007fffabc97f88]
00007FFFABC98470 00007fa3881cb23a [GCFrame: 00007fffabc98470]
(lldb) clrstack -f
OS Thread Id: 0xa8e (1)
        Child SP               IP Call Site
00007FA3887FF9F0 00007FA3881CB23A libpthread.so.0!__waitpid + 74 at /build/glibc-OTsEL5/glibc-2.27/sysdeps/unix/sysv/linux/waitpid.c:30
00007FA3887FFA20 00007FA386CCAE9D libcoreclr.so!PROCCreateCrashDump(char**) + 125
00007FA3887FFA60 00007FA386CC9BFE libcoreclr.so!PROCAbort + 14
00007FA3887FFA70 00007FA386C986EB libcoreclr.so!sigsegv_handler(int, siginfo_t*, void*) + 411
00007FA3887FFAC0 00007FA3881CB890 libpthread.so.0!__restore_rt
00007FFFAB49B000 00007FA30D6A8B48
00007FFFAB49BEB0 00007FA30D6A8CBF
00007FFFAB49DE60 00007FA30D6A8CBF
00007FFFAB49FE10 00007FA30D6A8CBF
[...]
00007FFFABC4A860 00007FA30D6A8CBF
00007FFFABC4C810 00007FA30D6A8CBF
00007FFFABC97F88                  [GCFrame: 00007fffabc97f88]
00007FFFABC98470                  [GCFrame: 00007fffabc98470]

dotnet sdk v3.1.100
dotnet-sos v3.1.120604

cc @mikem8361

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Diagnostics-coreclr untriaged New issue has not been triaged by the area owner labels Apr 24, 2020
@ghost
Copy link

ghost commented Apr 24, 2020

Tagging subscribers to this area: @tommcdon
Notify danmosemsft if you want to be subscribed.

@janvorli
Copy link
Member

The reason for this is that we abort due to stack overflow right in the sigsegv_handler. That means that no explicit frame for the failure is created on the stack and attached to the current native runtime thread object (as this is in PAL, we don't have access to that and we would not be allowed to access thread local stuff anyways at that stage (async signal handlers cannot do that unless they know that the location where the thread was interrupted is safe for doing that).
Stack walker that debugger uses can only start at an explicit frame that would point it to the first managed frame and so it cannot see the managed frames.
I have updated stack overflow handling in #32167. With that change, you should be able to see the call stack at stack overflow in lldb.

@k15tfu
Copy link
Contributor Author

k15tfu commented Apr 24, 2020

@janvorli Yep, I saw this issue too, thanks for adding this =)

But it's still unclear why it was okay a year ago (#12752) and as an option we could use minidumps by createdump on stack overflow. Maybe you know?

@janvorli
Copy link
Member

@k15tfu it really depends on where the stack overflow happens. If it happens in the native runtime code (like in the case you've referred to), you'd get the call stack in most cases, because when managed code calls native runtime methods, for many of those we create a HelperMethodFrame. But if it happens in the managed code, you are out of luck.

@tommcdon tommcdon added question Answer questions and provide assistance, not an issue with source code or documentation. and removed untriaged New issue has not been triaged by the area owner labels Apr 28, 2020
@tommcdon tommcdon added this to the 5.0 milestone Apr 28, 2020
@tommcdon
Copy link
Member

@mikem8361

@janvorli
Copy link
Member

Actually, I wonder - if the Thread::StackWalkFramesEx is passed an explicit context in managed code in the pRD, it should be able to walk the stack. I wonder if in this case we are passing it zeroed out pRD somehow.

@tommcdon tommcdon added the tracking This issue is tracking the completion of other related issues. label Jul 6, 2020
@tommcdon tommcdon modified the milestones: 5.0.0, 6.0.0 Jul 8, 2020
@k15tfu
Copy link
Contributor Author

k15tfu commented Jul 17, 2020

@mikem8361 Any thoughts why this happens?

@mikem8361
Copy link
Member

I haven't had a chance to investigate.

@mikem8361
Copy link
Member

I can repro this problem on the 3.1.5 runtime, but not on the latest 5.0 from master so this issue is fixed in most stack overflow cases.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Diagnostics-coreclr question Answer questions and provide assistance, not an issue with source code or documentation. tracking This issue is tracking the completion of other related issues.
Projects
None yet
Development

No branches or pull requests

5 participants