Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merge-on-Red] - Generate native/manage stack traces for runtime test crashes #77918

Closed
3 tasks
kunalspathak opened this issue Nov 4, 2022 · 8 comments
Closed
3 tasks
Assignees
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Nov 4, 2022

Native crashes of tests should produce the native/manage stack trace that can be shown in Build Analysis page for easier discoverability as well as linking to "known issues".

Goals

When runtime tests fail because of crash or assert, sometimes the developer has to open the console.log of failed test cases to find out the cause. With Merge-on-Red work, we want to provide better experience to the developer where they can see all the information in "Build Analysis" tab which includes the failure details as well as link to any known issues that is already filed in Github. To enable that, we need to have a way to produce textual report of managed/native stack trace that helix can send it back to AzDo.

Work Items

  • Use -crashreport feature of createDump.exe for Linux/mac to generate the stack trace report in json format. WIP: Produce crashreport.json and use llvm-symbolizer to create stack trace #77578
    • We discovered that -crashreport, which uses dladdr does not resolve the native symbol names for non-exported methods. As such, we might have to try different approach.
  • See if we can copy dotnet-sos command to the helix and update CoreclrTestWrapperLib.cs to the lldb command on the generated crash dump, whenever crash happens. The command would be something like lldb -c test.dmp ~/path/to/corerun -o "setclrpath ~/path/to/CORE_ROOT/" -o "bt all" -o "clrstack -all -f" -o "quit".
  • We might need to copy cdb in correlation payload or update helix/containers to include it. But once it is there, we could use !analyze to produce the native and managed (with the help of sos) stacktraces. We decided to install cdb (which comes with sos on windows helix machines). Install public version of windbg/cdb on windows helix machines arcade#11868
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 4, 2022
@kunalspathak
Copy link
Member Author

@JulieLeeMSFT @hoyosjs

@ghost
Copy link

ghost commented Nov 4, 2022

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Issue Details

Native crashes of tests should produce the native/manage stack trace that can be shown in Build Analysis page for easier discoverability as well as linking to "known issues".

Goals

When runtime tests fail because of crash or assert, sometimes the developer has to open the console.log of failed test cases to find out the cause. With Merge-on-Red work, we want to provide better experience to the developer where they can see all the information in "Build Analysis" tab which includes the failure details as well as link to any known issues that is already filed in Github. To enable that, we need to have a way to produce textual report of managed/native stack trace that helix can send it back to AzDo.

Work Items

  • Use -crashreport feature of createDump.exe for Linux/mac to generate the stack trace report in json format. WIP: Produce crashreport.json and use llvm-symbolizer to create stack trace #77578
    • We discovered that -crashreport, which uses dladdr does not resolve the native symbol names for non-exported methods. As such, we might have to try different approach.
  • See if we can copy dotnet-sos command to the helix and update CoreclrTestWrapperLib.cs to the lldb command on the generated crash dump, whenever crash happens. The command would be something like lldb -c test.dmp ~/path/to/corerun -o "setclrpath ~/path/to/CORE_ROOT/" -o "bt all" -o "clrstack -all -f" -o "quit".
  • We might need to copy cdb in correlation payload or update helix/containers to include it. But once it is there, we could use !analyze to produce the native and managed (with the help of sos) stacktraces.
Author: kunalspathak
Assignees: -
Labels:

area-Infrastructure, untriaged

Milestone: -

@jkoritzinsky
Copy link
Member

Another option would be to integrate with llvm-symbolizer. The various native sanitizer runtimes all integrate with this tool to convert addresses to symbols without needing to open up a dump file. It might be faster to integrate with this instead of lldb+dotnet-sos

@kunalspathak
Copy link
Member Author

this tool to convert addresses to symbols without needing to open up a dump file.

With that the flow would roughly be:

  1. Create the dump along with crashreport.json
  2. Extract the interesting addresses from the json file
  3. Use lldb-symbolizer to convert to symbols
  4. Replace the extracted addresses with symbol names in json file.

@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Nov 7, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Nov 7, 2022
@kunalspathak
Copy link
Member Author

kunalspathak commented Nov 11, 2022

With the help of @jkoritzinsky , I was able to get llvm-symbolizer get the symbol name. After that I wrote a logic to read various addresses from json file (see coredump.3654.crashreport.json.txt) and constructed the stack trace which looks like this:

Invoking llvm-symbolizer --obj=/home/kpathak/helix/77578/correlation-payload/CORE_ROOT/libcoreclr.so -p
----------------------------------
Thread Id: 0xe46
      Child SP               IP Call Site
 0x7fffe62428e0 0x7effc6e2b3cc libpthread.so.0!read
 0x7fffe6242910 0x7effc436d425 libSystem.Native.so!SystemNative_ReadStdin
 0x7fffe6242950 0x7eff4828340a 0x7eff4828340a
 0x7fffe6242a20 0x7eff48283276 System.Console.dll!Interop+Sys.ReadStdin(Byte*, Int32)
 0x7fffe6242a50 0x7eff482831c3 System.Console.dll!System.IO.StdInReader.ReadStdin(Byte*, Int32)
 0x7fffe6242aa0 0x7eff48282e3f System.Console.dll!System.IO.StdInReader.ReadKey()
 0x7fffe6242f80 0x7eff4828218f System.Console.dll!System.IO.StdInReader.ReadLineCore(Boolean)
 0x7fffe6243170 0x7eff48281eb5 System.Console.dll!System.IO.StdInReader.ReadLine()
 0x7fffe62431d0 0x7eff48281dff System.Console.dll!System.IO.SyncTextReader.ReadLine()
 0x7fffe6243220 0x7eff48281087 System.Console.dll!System.Console.ReadLine()
 0x7fffe6243260 0x7eff48272ca9 ReadLine.dll!Program.<Main>$(System.String[])
----------------------------------
Thread Id: 0xe47
      Child SP               IP Call Site
 0x7effc5d30458 0x7effc69f273d libc.so.6!syscall
 0x7effc5d30460 0x7effc5db115d liblttng-ust.so.0!
 0x7effc5d30f00 0x7effc6e20609 libpthread.so.0!
 0x7effc5d30fc0 0x7effc69f9133 libc.so.6!clone
----------------------------------
Thread Id: 0xe48
      Child SP               IP Call Site
 0x7effc552f458 0x7effc69f273d libc.so.6!syscall
 0x7effc552f460 0x7effc5db115d liblttng-ust.so.0!
 0x7effc552ff00 0x7effc6e20609 libpthread.so.0!
 0x7effc552ffc0 0x7effc69f9133 libc.so.6!clone
----------------------------------
Thread Id: 0xe49
      Child SP               IP Call Site
 0x7effc4d2ecd0 0x7effc69ec99f libc.so.6!__poll
 0x7effc4d2ed00 0x7effc67d27c2 CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(int, unsigned char*, int) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:2237:24
 0x7effc4d2ed60 0x7effc67d1d75 CorUnix::CPalSynchronizationManager::ReadCmdFromProcessPipe(int, CorUnix::CPalSynchronizationManager::SynchWorkerCmd*, void**, unsigned int*) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:2015:16
 0x7effc4d2edb0 0x7effc67d0eb8 CorUnix::CPalSynchronizationManager::WorkerThread(void*) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:1722:26
 0x7effc4d2ee40 0x7effc67e4672 CorUnix::CPalThread::ThreadEntry(void*) at /__w/1/s/src/coreclr/pal/src/thread/thread.cpp:1760:16
 0x7effc4d2ef00 0x7effc6e20609 libpthread.so.0!
 0x7effc4d2efc0 0x7effc69f9133 libc.so.6!clone
----------------------------------
Thread Id: 0xe4a
      Child SP               IP Call Site
 0x7effb7ffea30 0x7effc69ec99f libc.so.6!__poll
 0x7effb7ffea60 0x7effc663ea2c ipc_retry_syscall(int) at /__w/1/s/src/native/eventpipe/ds-ipc-pal-socket.c:283:16
 (inlined by) ipc_poll_fds(pollfd*, unsigned long, unsigned int) at /__w/1/s/src/native/eventpipe/ds-ipc-pal-socket.c:486:11
 (inlined by) ds_ipc_poll(_DiagnosticsIpcPollHandle*, unsigned long, unsigned int, void (*)(char const*, unsigned int)) at /__w/1/s/src/native/eventpipe/ds-ipc-pal-socket.c:1096:16
 0x7effb7ffeaa0 0x7effc6745836 ds_ipc_stream_factory_get_next_available_stream(void (*)(char const*, unsigned int)) at /__w/1/s/src/native/eventpipe/ds-ipc.c:406:15
 0x7effb7ffedd0 0x7effc6742fb6 server_thread(void*) at /__w/1/s/src/native/eventpipe/ds-server.c:130:8
 0x7effb7ffee40 0x7effc67e4672 CorUnix::CPalThread::ThreadEntry(void*) at /__w/1/s/src/coreclr/pal/src/thread/thread.cpp:1760:16
 0x7effb7ffef00 0x7effc6e20609 libpthread.so.0!
 0x7effb7ffefc0 0x7effc69f9133 libc.so.6!clone
----------------------------------
Thread Id: 0xe4b
      Child SP               IP Call Site
 0x7effb77fd660 0x7effc6e2bb34 libpthread.so.0!__open64
 0x7effb77fd6e0 0x7effc663f8bc TwoWayPipe::WaitForConnection() at /__w/1/s/src/coreclr/debug/debug-pal/unix/twowaypipe.cpp:87:19
 0x7effb77fd710 0x7effc6636f7b DbgTransportSession::TransportWorker() at /__w/1/s/src/coreclr/debug/ee/../shared/dbgtransportsession.cpp:1315:17
 0x7effb77fde20 0x7effc6635b45 DbgTransportSession::TransportWorkerStatic(void*) at /__w/1/s/src/coreclr/debug/ee/../shared/dbgtransportsession.cpp:1235:5
 0x7effb77fde40 0x7effc67e4672 CorUnix::CPalThread::ThreadEntry(void*) at /__w/1/s/src/coreclr/pal/src/thread/thread.cpp:1760:16
 0x7effb77fdf00 0x7effc6e20609 libpthread.so.0!
 0x7effb77fdfc0 0x7effc69f9133 libc.so.6!clone
----------------------------------
Thread Id: 0xe4c
      Child SP               IP Call Site
 0x7effb6ffc990 0x7effc6e27376 libpthread.so.0!pthread_cond_wait
 0x7effb6ffca60 0x7effc67ce747 CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:0:28
 0x7effb6ffcad0 0x7effc67cdd8b CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:301:22
 0x7effb6ffcb30 0x7effc67d748f CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) at /__w/1/s/src/coreclr/pal/src/synchmgr/wait.cpp:637:45
 0x7effb6ffccf0 0x7effc67d7e1f WaitForMultipleObjectsEx at /__w/1/s/src/coreclr/pal/src/synchmgr/wait.cpp:201:13
 0x7effb6ffcd40 0x7effc66328f6 DebuggerRCThread::MainLoop() at /__w/1/s/src/coreclr/debug/ee/rcthread.cpp:927:14
 0x7effb6ffcde0 0x7effc6632719 DebuggerRCThread::ThreadProc() at /__w/1/s/src/coreclr/debug/ee/rcthread.cpp:730:5
 0x7effb6ffce20 0x7effc6632170 DebuggerRCThread::ThreadProcStatic(void*) at /__w/1/s/src/coreclr/debug/ee/rcthread.cpp:1321:5
 0x7effb6ffce40 0x7effc67e4672 CorUnix::CPalThread::ThreadEntry(void*) at /__w/1/s/src/coreclr/pal/src/thread/thread.cpp:1760:16
 0x7effb6ffcf00 0x7effc6e20609 libpthread.so.0!
 0x7effb6ffcfc0 0x7effc69f9133 libc.so.6!clone
----------------------------------
Thread Id: 0xe4d
      Child SP               IP Call Site
 0x7effb67fb600 0x7effc6e277d1 libpthread.so.0!pthread_cond_timedwait
 0x7effb67fb6e0 0x7effc67ce728 CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:0:0
 0x7effb67fb750 0x7effc67cdd8b CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) at /__w/1/s/src/coreclr/pal/src/synchmgr/synchmanager.cpp:301:22
 0x7effb67fb7b0 0x7effc67d748f CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) at /__w/1/s/src/coreclr/pal/src/synchmgr/wait.cpp:637:45
 0x7effb67fb970 0x7effc67d7e1f WaitForMultipleObjectsEx at /__w/1/s/src/coreclr/pal/src/synchmgr/wait.cpp:201:13
 0x7effb67fb9c0 0x7effc639a451 FinalizerThread::WaitForFinalizerEvent(CLREvent*) at /__w/1/s/src/coreclr/vm/finalizerthread.cpp:173:9
 0x7effb67fba00 0x7effc639a702 FinalizerThread::FinalizerThreadWorker(void*) at /__w/1/s/src/coreclr/vm/finalizerthread.cpp:262:13
 0x7effb67fbc60 0x7effc62de99e ManagedThreadBase_DispatchMiddle(ManagedThreadCallState*) at /__w/1/s/src/coreclr/vm/threads.cpp:0:5
 (inlined by) ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::$_7::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at /__w/1/s/src/coreclr/vm/threads.cpp:7450:13
 (inlined by) ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::$_7::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const at /__w/1/s/src/coreclr/vm/threads.cpp:7452:9
 (inlined by) ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at /__w/1/s/src/coreclr/vm/threads.cpp:7476:5
 0x7effb67fbdc0 0x7effc62df1bd ManagedThreadBase_NoADTransition(void (*)(void*), UnhandledExceptionLocation) at /__w/1/s/src/coreclr/vm/threads.cpp:7520:5
 (inlined by) ManagedThreadBase::FinalizerBase(void (*)(void*)) at /__w/1/s/src/coreclr/vm/threads.cpp:7539:5
 0x7effb67fbdf0 0x7effc639ad48 FinalizerThread::FinalizerThreadStart(void*) at /__w/1/s/src/coreclr/vm/finalizerthread.cpp:394:22
 0x7effb67fbe40 0x7effc67e4672 CorUnix::CPalThread::ThreadEntry(void*) at /__w/1/s/src/coreclr/pal/src/thread/thread.cpp:1760:16
 0x7effb67fbf00 0x7effc6e20609 libpthread.so.0!
 0x7effb67fbfc0 0x7effc69f9133 libc.so.6!clone
----------------------------------
Thread Id: 0xe4f
      Child SP               IP Call Site
 0x7effc4146e60 0x7effc6e2b3cc libpthread.so.0!read
 0x7effc4146e90 0x7effc436a725 libSystem.Native.so!
 0x7effc4146f00 0x7effc6e20609 libpthread.so.0!
 0x7effc4146fc0 0x7effc69f9133 libc.so.6!clone

@kunalspathak
Copy link
Member Author

@jkoritzinsky or @hoyosjs - can one of you confirm the place from where I am invoking TryPrintStackTraceFromCrashReport(). I am calling it from CollectCrashDump() (when there is a hang) or if the RunTest() executable returned error code, in which case, I will check if coredump.<PID>.crashreport.json was created in HELIX_UPLOAD_FOLDER and if yes, call TryPrintStackTraceFromCrashReport() to print the stack trace.

@kunalspathak
Copy link
Member Author

Fixed with #77578

@ghost ghost locked as resolved and limited conversation to collaborators Apr 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants