Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need a better user experience for crash dumps occurring in PRs #31820

Closed
steveharter opened this issue Feb 5, 2020 · 11 comments
Closed

Need a better user experience for crash dumps occurring in PRs #31820

steveharter opened this issue Feb 5, 2020 · 11 comments

Comments

@steveharter
Copy link
Member

Currently if there is a crash in a PR there is no easy way to diagnose since:

  • The symbols are not attached to the PR and not available publically.
  • There are no instructions on how obtain the symbols (or runtime files).
  • Instructions on how to debug using lldb did not work. This may be an issue with the SOS plugin on OSX.

Background: As part of #2259 there was a StackOverFlowException on OSX during PR runs. Since the code was new, the crash only occurred in PR runs and runtime symbols are not public.

Steps taken:

  1. From the PR runs, I was unable to determine that a crash was occurring on OSX in System.Test.Json.Tests due to StackOverflowException.

If I was able to see verbose console output of the tests (which display the current tests running) or the current state of testsresults.xml I would have been able to debug the test that was causing the issue and wouldn't have needed to go through the additional steps below.

Optimally, I would see the failed test and the managed+native callstacks for the crash.

  1. The test was only crashing on OSX, so from my MacBook I attempted to repro the environment locally (build release CLR and debug version of tests). However, I was not able to reproduce the StackOverflow.

  2. On my MacBook, I download the core dump from the PR test run attachments and through some searches discovered helpful instructions at https://github.com/dotnet/diagnostics/blob/master/documentation/debugging-coredump.md

  3. On my MacBook I installed SOS and dotnet-symbols according to the instructions.

  4. The instructions do not explain how to get the symbols for PR runs. Asking for help, I was able to do that through some low-level web requests and download the runtime files and associated symbol files. Ideally these would have been attached to the PR, like the core dump was.

  5. The instructions state that dotnet-symbol --host-only will not work with local symbols so copy the symbols to a temp directory, so I did that (actually copied all runtime files to the temp location).

  6. Ran lldb --core /tmp/dump/core.123 /tmp/dump/dotnet. The instructions state "<host-program>" for the last parameter, so I used /tmp/dump/dotnet (also tried libnethost.dylib).

  7. From lldb ran setsymbolserver -directory /tmp/dump

  8. Finally tried to see the stack. Ran sos ClrStack (and other sos commands later) and got an exception (from SOS SymbolReader.LoadNativeSymbols):

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Diagnostics.TraceInternal.get_AppName()
   at System.Diagnostics.TraceInternal.TraceEvent(TraceEventType eventType, Int32 id, String format, Object[] args)
   at SOS.Tracer.Verbose(String format, Object[] arguments)
   at SOS.SymbolReader.LoadNativeSymbols(SymbolFileCallback callback, IntPtr parameter, String moduleFilePath, UInt64 address, Int32 size, ReadMemoryDelegate readMemory)
0  lldb                     0x00000001038c4705 llvm::sys::PrintStackTrace(llvm::raw_ostream&) + 37
1  lldb                     0x00000001038c3d77 llvm::sys::RunSignalHandlers() + 39
2  lldb                     0x00000001038c4d58 SignalHandler(int) + 264
3  libsystem_platform.dylib 0x00007fff6b70542d _sigtramp + 29
4  libsystem_platform.dylib 0x0000000000000001 _sigtramp + 2492443633
5  libsystem_c.dylib        0x00007fff6b5daa1c abort + 120
6  libcoreclr.dylib         0x00000001051f3a8e PROCAbort + 14
7  libcoreclr.dylib         0x00000001051f2662 PROCEndProcess(void*, unsigned int, int) + 226
8  libcoreclr.dylib         0x00000001054be541 UnwindManagedExceptionPass1(PAL_SEHException&, _CONTEXT*) + 737
9  libcoreclr.dylib         0x00000001054be6e0 DispatchManagedException(PAL_SEHException&, bool) + 304
10 libcoreclr.dylib         0x00000001054b89cd HandleHardwareException(PAL_SEHException*) + 669
11 libcoreclr.dylib         0x00000001051bacc1 SEHProcessException(PAL_SEHException*) + 353
12 libcoreclr.dylib         0x00000001051f7a15 PAL_DispatchException + 181
13 libcoreclr.dylib         0x00000001051f75b7 PAL_DispatchExceptionWrapper + 10
14 libcoreclr.dylib         0x000000011439e226 PAL_DispatchExceptionWrapper + 253389945
15 libcoreclr.dylib         0x000000011439f7c1 PAL_DispatchExceptionWrapper + 253395476
16 libcoreclr.dylib         0x0000000114252ff6 PAL_DispatchExceptionWrapper + 252033609
17 libcoreclr.dylib         0x0000000114247f22 PAL_DispatchExceptionWrapper + 251988341
18 libcoreclr.dylib         0x000000011424722a PAL_DispatchExceptionWrapper + 251985021
19 libcoreclr.dylib         0x0000000105546c6b UMThunkStub + 273
20 libsosplugin.dylib       0x0000000103971dfb LLDBServices::LoadNativeSymbols(lldb::SBTarget, lldb::SBModule, void (*)(void*, char const*, unsigned long, int)) + 539
21 libsosplugin.dylib       0x0000000103972007 LLDBServices::LoadNativeSymbols(bool, void (*)(void*, char const*, unsigned long, int)) + 375
22 libsosplugin.dylib       0x00000001039720fd non-virtual thunk to LLDBServices::LoadNativeSymbols(bool, void (*)(void*, char const*, unsigned long, int)) + 13
23 libsos.dylib             0x0000000104fead18 SetSymbolServer + 696
24 libsosplugin.dylib       0x000000010396db1f sosCommand::DoExecute(lldb::SBDebugger, char**, lldb::SBCommandReturnObject&) + 463
25 LLDB                     0x00000001061ef7bf CommandPluginInterfaceImplementation::DoExecute(lldb_private::Args&, lldb_private::CommandReturnObject&) + 207
26 LLDB                     0x000000010646f582 lldb_private::CommandObjectParsed::Execute(char const*, lldb_private::CommandReturnObject&) + 418
27 LLDB                     0x0000000106466d25 lldb_private::CommandInterpreter::HandleCommand(char const*, lldb_private::LazyBool, lldb_private::CommandReturnObject&, lldb_private::ExecutionContext*, bool, bool) + 2805
28 LLDB                     0x000000010646ad61 lldb_private::CommandInterpreter::IOHandlerInputComplete(lldb_private::IOHandler&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) + 657
29 LLDB                     0x00000001063a8f9d lldb_private::IOHandlerEditline::Run() + 285
30 LLDB                     0x00000001063903cb lldb_private::Debugger::ExecuteIOHandlers() + 123
31 LLDB                     0x000000010646b89c lldb_private::CommandInterpreter::RunCommandInterpreter(bool, bool, lldb_private::CommandInterpreterRunOptions&) + 156
32 LLDB                     0x0000000106214b91 lldb::SBDebugger::RunCommandInterpreter(bool, bool) + 209
33 lldb                     0x00000001038af665 Driver::MainLoop() + 2853
34 lldb                     0x00000001038b05d2 main + 1634
35 libdyld.dylib            0x00007fff6b50c7fd start + 1
36 libdyld.dylib            0x0000000000000004 start + 2494511112
Stack dump:
0. Program arguments: /Library/Developer/CommandLineTools/usr/bin/lldb --core /tmp/dump/core.82685 /tmp/dump/dotnet 
Abort trap: 6
  1. After unable to use lldb to get a call stack, I tried to use the the runtime downloaded from Helix against my local tests, and I was able to repro the exception and debug the tests.
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Feb 5, 2020
@danmoseley
Copy link
Member

cc @wfurt

@wfurt
Copy link
Member

wfurt commented Feb 5, 2020

Getting the core is not that hard. (even if not obvious) Getting matching test bits is more difficult.

@ViktorHofer
Copy link
Member

I was under the assumption that a crash dump would be sufficient to diagnose such issues. I'm unsure what the action item here is. @steveharter can you please clarify?

@steveharter
Copy link
Member Author

I was under the assumption that a crash dump would be sufficient to diagnose such issues. I'm unsure what the action item here is. @steveharter can you please clarify?

As @wfurt noted the test bits are necessary. It is not possible to debug a crash dump with lldb without matching symbols.

The description lists 3 issues: no symbols, no instructions on how to get the symbols, and lldb instructions\SOS plugin not working. The latter may not be an "infrastructure" issue but someone should vet the OSX developer experience for dumps caused in a PR as the lldb instructions\SOS didn't work for me.

@wfurt
Copy link
Member

wfurt commented Feb 6, 2020

Core may be sufficient if we published symbols to symbols server. e.g. official builds. I don't think we do for PRs. Otherwise, you need the bist and set setclrpath @ViktorHofer. @mikem8361 updated FAQ https://github.com/dotnet/diagnostics/blob/master/documentation/FAQ.md#frequently-asked-questions

@steveharter
Copy link
Member Author

Ideally, at least for the StackOverflow scenario I had, is that the test run information includes:

  1. The test(s) that were running at the time of the crash. i.e. if we did verbose logging to the console instead of xml file, or save the xml after each test start\end somehow, then someone could inspect that to see which tests started but didn't end.
  2. The full exception information -- i.e. the exception.StackTrace (does this not work with StackOverflowException?).
  3. The managed and native call stack from lldb\sos.

In rare cases where this information isn't enough to troubleshoot, then having access to the symbols and\or runtime would be nice to debug. In my case, a local build did not work to troubleshoot I assume due to different optimizations or local settings that didn't cause a StackOverflow.

@ViktorHofer
Copy link
Member

The test(s) that were running at the time of the crash. i.e. if we did verbose logging to the console instead of xml file, or save the xml after each test start\end somehow, then someone could inspect that to see which tests started but didn't end.

That's something that we would like to do when we switch to dotnet test (VSTest platform):

--Blame|/Blame,rRuns the tests in blame mode. This option is helpful in isolating the problematic tests causing test host to crash. It creates an output file in the current directory as Sequence.xml that captures the order of tests execution before the crash.

from https://docs.microsoft.com/en-us/dotnet/core/tools/dotnet-vstest?tabs=netcore21

@steveharter
Copy link
Member Author

  1. The full exception information -- i.e. the exception.StackTrace (does this not work with StackOverflowException?).

This PR #32167 should add StackTrace to StackOverflowException.

@ViktorHofer ViktorHofer added this to the 5.0.0 milestone Jul 12, 2020
@ViktorHofer ViktorHofer added area-Infrastructure-libraries and removed untriaged New issue has not been triaged by the area owner area-Infrastructure labels Jul 12, 2020
@ghost
Copy link

ghost commented Jul 12, 2020

Tagging subscribers to this area: @safern, @ViktorHofer
Notify danmosemsft if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Apr 15, 2024

The description lists 3 issues: no symbols, no instructions on how to get the symbols, and lldb instructions\SOS plugin not working.

All these 3 issues are addressed by .md file that is generated next to the crash dump with detailed instructions for how to download the crash dump, matching symbols and SOS. (The template is at https://github.com/dotnet/runtime/blob/main/eng/testing/debug-dump-template.md.)

@jkotas jkotas closed this as completed Apr 15, 2024
@github-actions github-actions bot locked and limited conversation to collaborators May 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

6 participants