Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Linux] GCStress=C assert: "(m_pStartFrame != NULL) || (m_flags & POPFRAMES) || (m_pRealStartFrame == m_pThread->GetFrame()) || (bRedirectedPinvoke)" #11678

Closed
BruceForstall opened this issue Dec 18, 2018 · 9 comments · Fixed by dotnet/coreclr#25359
Assignees
Labels
area-VM-coreclr GCStress os-linux Linux OS (any supported distro)
Milestone

Comments

@BruceForstall
Copy link
Member

Seen in several arm32 GCStress=C runs, but not every run, and not the same test. Some examples:

https://ci.dot.net/job/dotnet_coreclr/job/master/view/arm/job/jitstress/job/arm_cross_checked_ubuntu_gcstress0xc_flow/44/

14:32:49       Loader_classloader._TypeGeneratorTests_TypeGeneratorTest1038_Generated1038_Generated1038_._TypeGeneratorTests_TypeGeneratorTest1038_Generated1038_Generated1038_sh [FAIL]
14:32:49         
14:32:49 
        Assert failure(PID 443 [0x000001bb], Thread: 504 [0x01f8]): (m_pStartFrame != NULL) || (m_flags & POPFRAMES) || (m_pRealStartFrame == m_pThread->GetFrame()) || (bRedirectedPinvoke)
14:32:49             File: /mnt/j/workspace/dotnet_coreclr/master/arm_cross_checked_ubuntu/src/vm/stackwalk.cpp Line: 1577
14:32:49             Image: /ssd/j/workspace/dotnet_coreclr/master/jitstress/arm_cross_checked_ubuntu_gcstress0xc_tst/bin/tests/Linux.arm.Checked/Tests/Core_Root/corerun
14:32:49         
14:32:49         /ssd/j/workspace/dotnet_coreclr/master/jitstress/arm_cross_checked_ubuntu_gcstress0xc_tst/bin/tests/Linux.arm.Checked/Loader/classloader/TypeGeneratorTests/TypeGeneratorTest1038/Generated1038/Generated1038.sh: line 244:   443 Aborted                 $_DebuggerFullPath "$CORE_ROOT/corerun" $ExePath $CLRTestExecutionArguments

https://ci.dot.net/job/dotnet_coreclr/job/master/view/arm/job/jitstress/job/arm_cross_checked_ubuntu_gcstress0xc_flow/42/

11:36:45 
      GC_Coverage._LargeObjectAlloc2_LargeObjectAlloc2_._LargeObjectAlloc2_LargeObjectAlloc2_sh [FAIL]
11:36:45         
11:36:45 
        Assert failure(PID 28988 [0x0000713c], Thread: 29011 [0x7153]): (m_pStartFrame != NULL) || (m_flags & POPFRAMES) || (m_pRealStartFrame == m_pThread->GetFrame()) || (bRedirectedPinvoke)
11:36:45             File: /mnt/j/workspace/dotnet_coreclr/master/arm_cross_checked_ubuntu/src/vm/stackwalk.cpp Line: 1577
11:36:45             Image: /ssd/j/workspace/dotnet_coreclr/master/jitstress/arm_cross_checked_ubuntu_gcstress0xc_tst/bin/tests/Linux.arm.Checked/Tests/Core_Root/corerun
11:36:45         
11:36:45         /ssd/j/workspace/dotnet_coreclr/master/jitstress/arm_cross_checked_ubuntu_gcstress0xc_tst/bin/tests/Linux.arm.Checked/GC/Coverage/LargeObjectAlloc2/LargeObjectAlloc2.sh: line 244: 28988 Aborted                 $_DebuggerFullPath "$CORE_ROOT/corerun" $ExePath $CLRTestExecutionArguments
@BruceForstall
Copy link
Member Author

@janvorli

@RussKeldorph
Copy link
Contributor

Any chance this is in some way related to #11097?

@VincentBu
Copy link
Contributor

https://mc.dot.net/#/user/coreclr-outerloop-gcstress-extra/ci2Fdotnet2Fcoreclr2Frefs2Fheads2Fmaster/test2Ffunctional2Fcli2F/20190615.1/workItem/baseservices.threading/analysis/xunit/baseservices_threading._threadstatic_threadstatic07_threadstatic07_~2F_threadstatic_threadstatic07_threadstatic07_sh
Appeared in following configuration:
Debian.9.Amd64.Open-x64-Checked-gcstress0xc_jitstress1
Ubuntu.1804.Arm32.Open-arm-Checked-gcstress0x3
Ubuntu.1804.Arm32.Open-arm-Checked-gcstress0xc
Ubuntu.1804.Arm32.Open-arm-Checked-gcstress0xc_zapdisable_jitstress2

@sergiy-k
Copy link
Contributor

@janvorli, could you please take a look? Please note that this issue is not arm specific.

@janvorli
Copy link
Member

Please note that this issue is not arm specific.

Right, Kusto shows that it happens on Windows 10 in about 50% of the cases (there are 34 occurences total). However the last occurence on Windows was at the end of March, so maybe that one was a different underlying problem.
Anyways, on Linux, it keeps happening and it actually happens more often on AMD64 than on ARM32.

@janvorli
Copy link
Member

After a whole weekend and a day running the test in a loop, I was able to reproduce it on my Ubuntu x64 vm and got a core dump. I am looking into it.

@janvorli janvorli changed the title [Linux/arm32] GCStress=C assert: "(m_pStartFrame != NULL) || (m_flags & POPFRAMES) || (m_pRealStartFrame == m_pThread->GetFrame()) || (bRedirectedPinvoke)" [Linux] GCStress=C assert: "(m_pStartFrame != NULL) || (m_flags & POPFRAMES) || (m_pRealStartFrame == m_pThread->GetFrame()) || (bRedirectedPinvoke)" Jun 24, 2019
@janvorli
Copy link
Member

I believe I understand the issue. Below there is a call stack of the thread being walked and StackFrameIterator state that shows what's going on. What happened is that when the stack walking thread initiated the stack frame iterator, the thread being walked had the m_pThread->m_pFrame (the topmost explicit frame) set to the InlinedCallFrame (0x00007f764bffdba0). The stack walking thread saved that to m_pRealStartFrame. The thread being walked was not suspended, as it was running in native code somewhere between the InlinedCallFrame and ResumableFrame in the call stack below. We don't need to suspend such threads as we only care about the managed frames staying untouched and we have traps at unmanaged to managed transition (both when returning from the PInvoke that uses the InlinedCallFrame and when calling any managed callback).
But this case is special, as we were handling invalid instruction that was intentionally injected by GC stress. And that resulted in creating the explicit ResumableFrame that got pushed to the thread's stack of explicit frames. The stack walking thread was running in parallel with that and did a sanity check using StackFrameIterator::IsValid. But that check failed, since the m_pRealStartFrame didn't match the topmost frame in the explicit frames stack of the thread being walked. And so the assert has fired.

StackFrameIterator state at the time the assert fired:

(lldb) p m_pStartFrame
(PTR_Frame) $0 = 0x0000000000000000
(lldb) p/x m_flags
(ULONG32) $2 = 0x00008500
(lldb) p m_pRealStartFrame
(InlinedCallFrame *) $3 = 0x00007f764bffdba0
(lldb) p m_pThread->m_pFrame
(ResumableFrame *) $4 = 0x00007f764bffc8d0
(lldb) p bRedirectedPinvoke
(BOOL) $5 = NO

Call stack of the thread being walked:

OS Thread Id: 0x39e6 (13)
        Child SP               IP Call Site
00007F764BFFB910 00007F7713C45360 libpthread.so.0!__pthread_cond_wait + 192
00007F764BFFB940 00007F77128D0095 libcoreclr.so!CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 645 at /home/janvorli/git/coreclr2/src/pal/src/synchmgr/synchmanager.cpp:479
00007F764BFFBAA0 00007F77128CF361 libcoreclr.so!CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 1569 at /home/janvorli/git/coreclr2/src/pal/src/synchmgr/synchmanager.cpp:302
00007F764BFFBC90 00007F77128E149F libcoreclr.so!CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 5775 at /home/janvorli/git/coreclr2/src/pal/src/synchmgr/wait.cpp:636
00007F764BFFC0D0 00007F77128E2051 libcoreclr.so!WaitForSingleObjectEx + 273 at /home/janvorli/git/coreclr2/src/pal/src/synchmgr/wait.cpp:139
00007F764BFFC170 00007F7712341F50 libcoreclr.so!CLREventWaitHelper2(void*, unsigned int, int) + 48 at /home/janvorli/git/coreclr2/src/vm/synch.cpp:377
00007F764BFFC190 00007F7712341DFA libcoreclr.so!CLREventWaitHelper(void*, unsigned int, int)::$_1::operator()(CLREventWaitHelper(void*, unsigned int, int)::Param*) const + 42 at /home/janvorli/git/coreclr2/src/vm/synch.cpp:402
00007F764BFFC1B0 00007F771234068D libcoreclr.so!CLREventWaitHelper(void*, unsigned int, int) + 141 at /home/janvorli/git/coreclr2/src/vm/synch.cpp:404
00007F764BFFC270 00007F77123405DE libcoreclr.so!CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 430 at /home/janvorli/git/coreclr2/src/vm/synch.cpp:471
00007F764BFFC2D0 00007F771234041D libcoreclr.so!CLREventBase::Wait(unsigned int, int, PendingSync*) + 77 at /home/janvorli/git/coreclr2/src/vm/synch.cpp:417
00007F764BFFC300 00007F77124E9A65 libcoreclr.so!GCEvent::Impl::Wait(unsigned int, bool) + 133 at /home/janvorli/git/coreclr2/src/vm/gcenv.os.cpp:1153
00007F764BFFC330 00007F77124E8E25 libcoreclr.so!GCEvent::Wait(unsigned int, bool) + 117 at /home/janvorli/git/coreclr2/src/vm/gcenv.os.cpp:1231
00007F764BFFC360 00007F77125A0C0A libcoreclr.so!WKS::gc_heap::wait_for_gc_done(int) + 74 at /home/janvorli/git/coreclr2/src/gc/gc.cpp:10322
00007F764BFFC390 00007F77125A0B4D libcoreclr.so!WKS::WaitLongerNoInstru(int) + 189 at /home/janvorli/git/coreclr2/src/gc/gc.cpp:1656
00007F764BFFC3C0 00007F77125B9022 libcoreclr.so!WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) + 466 at /home/janvorli/git/coreclr2/src/gc/gc.cpp:1689
00007F764BFFC3C0 00007F77125B8EF7 libcoreclr.so!WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) + 167 at /home/janvorli/git/coreclr2/src/gc/gc.cpp:1686
00007F764BFFC3C0 00007F77125B8EE3 libcoreclr.so!WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) + 147 at /home/janvorli/git/coreclr2/src/gc/gc.cpp:1742
00007F764BFFC510 00007F7712604DAB libcoreclr.so!WKS::GCHeap::GarbageCollectTry(int, int, int) + 299 at /home/janvorli/git/coreclr2/src/gc/gc.cpp:35420
00007F764BFFC550 00007F7712605F56 libcoreclr.so!WKS::GCHeap::GarbageCollect(int, bool, int) + 902 at /home/janvorli/git/coreclr2/src/gc/gc.cpp:35354
00007F764BFFC620 00007F7712604C17 libcoreclr.so!WKS::GCHeap::StressHeap(gc_alloc_context*) + 2743 at /home/janvorli/git/coreclr2/src/gc/gc.cpp:34860
00007F764BFFC7C0 00007F771222994C libcoreclr.so!DoGcStress(_CONTEXT*, MethodDesc*) + 1644 at /home/janvorli/git/coreclr2/src/vm/gccover.cpp:1830
00007F764BFFC8D0                  [ResumableFrame: 00007f764bffc8d0]
00007F764BFFC9A0 00007F771222927E libcoreclr.so!OnGcCoverageInterrupt(_CONTEXT*) + 526 at /home/janvorli/git/coreclr2/src/vm/gccover.cpp:1453
00007F764BFFCA60 00007F77120126D2 libcoreclr.so!IsGcMarker(_CONTEXT*, _EXCEPTION_RECORD*) + 82 at /home/janvorli/git/coreclr2/src/vm/excep.cpp:6599
00007F764BFFCAC0 00007F77123893A1 libcoreclr.so!HandleHardwareException(PAL_SEHException*) + 257 at /home/janvorli/git/coreclr2/src/vm/exceptionhandling.cpp:5225
00007F764BFFD0F0 00007F771281B575 libcoreclr.so!SEHProcessException(PAL_SEHException*) + 581 at /home/janvorli/git/coreclr2/src/pal/src/exception/seh.cpp:286
00007F764BFFD190 00007F771281E2CE libcoreclr.so!common_signal_handler(int, siginfo_t*, void*, int, ...) + 1198 at /home/janvorli/git/coreclr2/src/pal/src/exception/signal.cpp:837
00007F764BFFD9E0 00007F771281DC68 libcoreclr.so!signal_handler_worker + 504 at /home/janvorli/git/coreclr2/src/pal/src/exception/signal.cpp:378
00007F764BFFDB00 00007F7712902F16 libcoreclr.so!CallSignalHandlerWrapper0 + 6 at /home/janvorli/git/coreclr2/src/pal/src/arch/amd64/callsignalhandlerwrapper.S:30
00007F764BFFDB90 00007F7698E0B162
00007F764BFFDBA0                  [InlinedCallFrame: 00007f764bffdba0] System.Console.dll!Interop+Sys.Write(System.Runtime.InteropServices.SafeHandle, Byte*, Int32)
00007F764BFFDB90 00007F7698E0B162 System.Console.dll!ILStubClass.IL_STUB_PInvoke + 178
00007F764BFFDC30 00007F7698E0AA09 System.Console.dll!System.ConsolePal.Write(Microsoft.Win32.SafeHandles.SafeFileHandle, Byte*, Int32, Boolean) + 153 [/_/src/System.Console/src/System/ConsolePal.Unix.cs @ 1249]
00007F764BFFDCF0 00007F7698E0A93F System.Console.dll!System.ConsolePal.Write(Microsoft.Win32.SafeHandles.SafeFileHandle, Byte[], Int32, Int32, Boolean) + 127
00007F764BFFDD40 00007F7698E0A728 System.Console.dll!System.ConsolePal+UnixConsoleStream.Write(Byte[], Int32, Int32) + 72 [/_/src/System.Console/src/System/ConsolePal.Unix.cs @ 1454]
00007F764BFFDD70 00007F7698984C9A System.Private.CoreLib.dll!System.IO.StreamWriter.Flush(Boolean, Boolean) + 218
00007F764BFFDDC0 00007F769898612F System.Private.CoreLib.dll!System.IO.StreamWriter.WriteFormatHelper(System.String, System.ParamsArray, Boolean) + 1215
00007F764BFFDEC0 00007F76989867BB System.Private.CoreLib.dll!System.IO.StreamWriter.WriteLine(System.String, System.Object[]) + 171
00007F764BFFDF20 00007F769899000C System.Private.CoreLib.dll!System.IO.TextWriter+SyncTextWriter.WriteLine(System.String, System.Object[]) + 76
00007F764BFFDF60 00007F7698E13DB0 System.Console.dll!System.Console.WriteLine(System.String, System.Object[]) + 112 [/_/src/System.Console/src/System/Console.cs @ 577]
00007F764BFFDF90 00007F7698E17657 threadstatic07.exe!Sensor.ThreadFunc() + 1879 [/home/janvorli/git/coreclr2/tests/src/baseservices/threading/threadstatic/threadstatic07.cs @ 136]
00007F764BFFE130 00007F769888450D System.Private.CoreLib.dll!System.Threading.ThreadHelper.ThreadStart_Context(System.Object) + 205
00007F764BFFE160 00007F769888F7A1 System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 129
00007F764BFFE1B0 00007F76988846B2 System.Private.CoreLib.dll!System.Threading.ThreadHelper.ThreadStart() + 114
00007F764BFFE1E0 00007F77123ADC03 libcoreclr.so!CallDescrWorkerInternal + 124 at /home/janvorli/git/coreclr2/src/pal/inc/unixasmmacrosamd64.inc:870
00007F764BFFE200 00007F77121A0CF0 libcoreclr.so!CallDescrWorkerWithHandler(CallDescrData*, int) + 400 at /home/janvorli/git/coreclr2/src/vm/callhelpers.cpp:70
00007F764BFFE270 00007F77121A2231 libcoreclr.so!MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 2401 at /home/janvorli/git/coreclr2/src/vm/callhelpers.cpp:607
00007F764BFFE580 00007F7711FE90D3 libcoreclr.so!MethodDescCallSite::Call(unsigned long const*) + 35 at /home/janvorli/git/coreclr2/src/vm/callhelpers.h:468
00007F764BFFE5A0 00007F77121B511F libcoreclr.so!ThreadNative::KickOffThread_Worker(void*) + 1055 at /home/janvorli/git/coreclr2/src/vm/comsynchronizable.cpp:246
00007F764BFFE7E8                  [GCFrame: 00007f764bffe7e8]
00007F764BFFE850 00007F771213DDF8 libcoreclr.so!ManagedThreadBase_DispatchInner(ManagedThreadCallState*) + 40 at /home/janvorli/git/coreclr2/src/vm/threads.cpp:7411
00007F764BFFE870 00007F771213D6E5 libcoreclr.so!ManagedThreadBase_DispatchMiddle(ManagedThreadCallState*) + 149 at /home/janvorli/git/coreclr2/src/vm/threads.cpp:7454
00007F764BFFEA20 00007F771213D51D libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) const::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const + 29 at /home/janvorli/git/coreclr2/src/vm/threads.cpp:7620
00007F764BFFEA40 00007F771213D34F libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::$_6::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const + 111 at /home/janvorli/git/coreclr2/src/vm/threads.cpp:7620
00007F764BFFEAE0 00007F771213A2C5 libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 229 at /home/janvorli/git/coreclr2/src/vm/threads.cpp:7646
00007F764BFFEB78                  [DebuggerU2MCatchHandlerFrame: 00007f764bffeb78]
00007F764BFFEBA0 00007F771213A407 libcoreclr.so!ManagedThreadBase_FullTransition(void (*)(void*), void*, UnhandledExceptionLocation) + 71 at /home/janvorli/git/coreclr2/src/vm/threads.cpp:7666
00007F764BFFEC00 00007F771213A3B2 libcoreclr.so!ManagedThreadBase::KickOff(void (*)(void*), void*) + 34 at /home/janvorli/git/coreclr2/src/vm/threads.cpp:7702
00007F764BFFEC20 00007F77121B5AE0 libcoreclr.so!ThreadNative::KickOffThread(void*) + 1936 at /home/janvorli/git/coreclr2/src/vm/comsynchronizable.cpp:363
00007F764BFFEDF0 00007F7712131439 libcoreclr.so!Thread::intermediateThreadProc(void*) + 169 at /home/janvorli/git/coreclr2/src/vm/threads.cpp:2060
00007F764BFFEE40 00007F77128F9F94 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*) + 1588 at /home/janvorli/git/coreclr2/src/pal/src/thread/thread.cpp:1807
00007F764BFFEF20 00007F7713C3F6BA libpthread.so.0!start_thread + 202
00007F764BFFEFC0 00007F77130D441D libc.so.6!__clone + 109 at /build/glibc-Cl5G7W/glibc-2.23/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:111

@janvorli
Copy link
Member

And one more additional detail. One might wonder how come the bRedirectedPinvoke was false, as it should have been set to to true when the topmost explicit frame of the thread being walked was a ResumableFrame. What I believe happened here was that the thread being walked actually added the ResumableFrame after the bRedirectedPinvoke was computed, but before the assert condition was evaluated. That would explain why it took it three days running the test in a loop to reproduce.
https://github.com/dotnet/coreclr/blob/bad8d915c267057bf74cb819c9e027d68cd1fa35/src/vm/stackwalk.cpp#L1553-L1570

@janvorli
Copy link
Member

I believe the fix should be to not to precompute the bRedirectedPinvoke, but add the whole condition to the end of the _ASSERTE instead.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr GCStress os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants