Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: JIT\\Regression\\JitBlue\\GitHub_27924\\GitHub_27924\\GitHub_27924.cmd #36681

Closed
v-haren opened this issue May 19, 2020 · 25 comments · Fixed by #41540
Closed

Test failure: JIT\\Regression\\JitBlue\\GitHub_27924\\GitHub_27924\\GitHub_27924.cmd #36681

v-haren opened this issue May 19, 2020 · 25 comments · Fixed by #41540
Assignees
Labels
arch-arm32 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' JitStress CLR JIT issues involving JIT internal stress modes os-windows
Milestone

Comments

@v-haren
Copy link

v-haren commented May 19, 2020

failed in job: runtime-coreclr jitstress 20200517.2

Error message

Assert failure(PID 5488 [0x00001570], Thread: 11524 [0x2d04]): _safePointDecoder.IsSafePoint(curOffs)

<no module>! <no symbol> + 0x0 (0x00000000)
 File: F:\workspace\_work\1\s\src\coreclr\src\vm\eetwain.cpp Line: 5044
 Image: D:\h\w\AAF5093E\p\CoreRun.exe


Return code: 1
Raw output file: D:\h\w\AAF5093E\w\B29D099C\e\JIT\Regression\Reports\JIT.Regression\JitBlue\GitHub_27924\GitHub_27924\GitHub_27924.output.txt
Raw output:
BEGIN EXECUTION
 "D:\h\w\AAF5093E\p\corerun.exe" GitHub_27924.dll 
Expected: 100
Actual: -1073740286
END EXECUTION - FAILED
FAILED
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=D:\h\w\AAF5093E\p
> D:\h\w\AAF5093E\w\B29D099C\e\JIT\Regression\JitBlue\GitHub_27924\GitHub_27924\GitHub_27924.cmd
Expected: True
Actual: False


Stack trace
   at JIT_Regression._JitBlue_GitHub_27924_GitHub_27924_GitHub_27924_._JitBlue_GitHub_27924_GitHub_27924_GitHub_27924_cmd() in F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.arm.Checked\TestWrappers\JIT.Regression\JIT.Regression.XUnitWrapper.cs:line 122360

category:correctness
theme:gc-stress
skill-level:expert
cost:large

@v-haren v-haren added JitStress CLR JIT issues involving JIT internal stress modes area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels May 19, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 19, 2020
@BruceForstall BruceForstall added this to the 5.0 milestone May 19, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label May 19, 2020
@BruceForstall BruceForstall self-assigned this May 19, 2020
@BruceForstall
Copy link
Member

@safern I haven't had a chance to look at this yet. Hopefully this week.

@v-haren
Copy link
Author

v-haren commented Jun 12, 2020

failed again in job: runtime-coreclr jitstress 20200610.1

failed test: JIT\Regression\JitBlue\GitHub_27924\GitHub_27924\GitHub_27924.cmd

Error message

Assert failure(PID 5208 [0x00001458], Thread: 3608 [0x0e18]): _safePointDecoder.IsSafePoint(curOffs)

<no module>! <no symbol> + 0x0 (0x00000000)
 File: F:\workspace\_work\1\s\src\coreclr\src\vm\eetwain.cpp Line: 5044
 Image: D:\h\w\A13108F5\p\CoreRun.exe


Return code: 1
Raw output file: D:\h\w\A13108F5\w\8FFE0869\e\JIT\Regression\Reports\JIT.Regression\JitBlue\GitHub_27924\GitHub_27924\GitHub_27924.output.txt
Raw output:
BEGIN EXECUTION
 "D:\h\w\A13108F5\p\corerun.exe" GitHub_27924.dll 

Expected: 100
Actual: -1073740286
END EXECUTION - FAILED
FAILED
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=D:\h\w\A13108F5\p
> D:\h\w\A13108F5\w\8FFE0869\e\JIT\Regression\JitBlue\GitHub_27924\GitHub_27924\GitHub_27924.cmd
Expected: True
Actual: False


Stack trace
   at JIT_Regression._JitBlue_GitHub_27924_GitHub_27924_GitHub_27924_._JitBlue_GitHub_27924_GitHub_27924_GitHub_27924_cmd() in F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.arm.Checked\TestWrappers\JIT.Regression\JIT.Regression.XUnitWrapper.cs:line 122360

@BruceForstall
Copy link
Member

Failures are in CoreCLR Windows_NT arm Checked zapdisable @ Windows.10.Arm64v8.Open

@BruceForstall
Copy link
Member

BruceForstall commented Jun 13, 2020

(Note: Windows arm issue)

Even though this configuration sets:

set COMPlus_TieredCompilation=0
set COMPlus_ReadyToRun=0
set COMPlus_ZapDisable=1

the test also sets

COMPlus_GCStress=0xc

(it's the only one in the coreclr test tree to do so, and it's a Pri-0 test, so it probably gets more GCStress runs by far than any other test.)

I couldn't repro it initially, but when I tweaked some of the loops I was able to repro about every 10+ runs, and caught it in the debugger.

The assert is in GCstress code in EECodeManager::EnumGcRefs() that asserts that if the current IP is non-interruptible, then it is a safe point offset after a call, specifically:

#if defined(STRESS_HEAP) && defined(PARTIALLY_INTERRUPTIBLE_GC_SUPPORTED)
    // When we simulate a hijack during gcstress
    //  we start with ActiveStackFrame and the offset
    //  after the call
    // We need to make it look like a non-leaf frame
    //  so that it's treated like a regular hijack
    if (flags & ActiveStackFrame)
    {
        GcInfoDecoder _gcInfoDecoder(
                            gcInfoToken,
                            DECODE_INTERRUPTIBILITY,
                            curOffs
                            );
        if(!_gcInfoDecoder.IsInterruptible())
        {
            // This must be the offset after a call
#ifdef _DEBUG
            GcInfoDecoder _safePointDecoder(gcInfoToken, (GcInfoDecoderFlags)0, 0);
            _ASSERTE(_safePointDecoder.IsSafePoint(curOffs)); /////////////// ********************
#endif
            flags &= ~((unsigned)ActiveStackFrame);
        }
    }
#endif // STRESS_HEAP && PARTIALLY_INTERRUPTIBLE_GC_SUPPORTED

There are many threads. The thread being walked that causes the assert is:

00 0b8ef070 75b6506a ntdll!ZwWaitForSingleObject+0x8
01 0b8ef080 688468de KERNELBASE!WaitForSingleObjectEx+0xab
02 (Inline) -------- CoreCLR!CLREventWaitHelper2(void)+0x7 (Inline Function @ 688468df)
03 (Inline) -------- CoreCLR!CLREventWaitHelper::__l4::__Body::Run(void)+0x7 (Inline Function @ 688468df)
04 0b8ef0e0 6884740a CoreCLR!CLREventWaitHelper(void * handle = 0xfffffe68, unsigned long dwMilliseconds = 0xfffffe68, int alertable = 0n-408)+0x23
05 0b8ef110 68a19e78 CoreCLR!CLREventBase::WaitEx(unsigned long dwMilliseconds = <Value unavailable error>, WaitMode mode = WaitMode_None (0n0), struct PendingSync * syncState = 0x00000000)+0xbb
06 (Inline) -------- CoreCLR!GCEvent::Impl::Wait(void)+0x21 (Inline Function @ 68a19e79)
07 0b8ef148 68a3c4c0 CoreCLR!GCEvent::Wait(unsigned int timeout = 0xffffffff, bool alertable = false)+0x41
08 0b8ef160 68a1fb34 CoreCLR!WKS::gc_heap::wait_for_gc_done(int timeOut = 0n-1)+0x21
09 0b8ef178 68a1d8d0 CoreCLR!WKS::WaitLongerNoInstru(int i = 0n1)+0x51
0a (Inline) -------- CoreCLR!WKS::enter_spin_lock_noinstru(void)+0x107 (Inline Function @ 68a1d8d1)
0b (Inline) -------- CoreCLR!WKS::enter_spin_lock(void)+0x107 (Inline Function @ 68a1d8d1)
0c 0b8ef190 68a1f660 CoreCLR!WKS::GCHeap::GarbageCollectGeneration(unsigned int gen = <Value unavailable error>, gc_reason reason = <Value unavailable error>)+0x129
0d (Inline) -------- CoreCLR!WKS::GCHeap::GarbageCollectTry(void)+0x17 (Inline Function @ 68a1f661)
0e 0b8ef1c8 689351dc CoreCLR!WKS::GCHeap::StressHeap(struct gc_alloc_context * context = <Value unavailable error>)+0x371
0f 0b8ef228 6888ffc6 CoreCLR!DoGcStress(struct _CONTEXT * regs = <Value unavailable error>, class NativeCodeVersion * nativeCodeVersion = <Value unavailable error>)+0x251
10 0b8ef2d8 688115b6 CoreCLR!Thread::RedirectedHandledJITCase(Thread::RedirectReason reason = RedirectReason_GCStress (0n3))+0xfb
11 0b8ef350 0bc8423c CoreCLR!RedirectedHandledJITCaseForGCStress_Stub(void)+0x2b
12 0b8ef360 0bc83536 System_Private_CoreLib!System.Threading.FastRandom.Next(<Win32 error 0n318>)+0x74
13 0b8ef378 0b27f0ca System_Private_CoreLib!System.Threading.ThreadPoolWorkQueue.Dequeue(<Win32 error 0n318>)+0xda
14 0b8ef3a8 0b27efd0 System_Private_CoreLib!System.Threading.ThreadPoolWorkQueue.Dispatch(<Win32 error 0n318>)+0xe6
15 0b8ef400 6881104c System_Private_CoreLib!System.Threading._ThreadPoolWaitCallback.PerformWaitCallback(<Win32 error 0n318>)+0xc
16 0b8ef408 6891e564 CoreCLR!CallDescrWorkerInternal(void)+0x45
17 0b8ef418 6891e9ae CoreCLR!CallDescrWorker(struct CallDescrData * pCallDescrData = 0x0b8ef918)+0x99
18 0b8ef840 689e9154 CoreCLR!MethodDescCallSite::CallTargetWorker(unsigned int64 * pArguments = 0x00000000, unsigned int64 * pReturnValue = <Value unavailable error>, int cbReturnValue = <Value unavailable error>)+0x35b
19 (Inline) -------- CoreCLR!MethodDescCallSite::Call_RetBool(void)+0x77 (Inline Function @ 689e9155)
1a 0b8ef978 6888a076 CoreCLR!QueueUserWorkItemManagedCallback(void * pArg = 0x0b8efbe1)+0xc5
1b (Inline) -------- CoreCLR!ManagedThreadBase_DispatchInner(void)+0x15 (Inline Function @ 6888a077)
1c 0b8efa28 6888b434 CoreCLR!ManagedThreadBase_DispatchMiddle(struct ManagedThreadCallState * pCallState = 0x0b8efb60)+0x67
1d (Inline) -------- CoreCLR!ManagedThreadBase_DispatchOuter::__l11::__Body::Run::__l5::__Body::Run(void)+0x7 (Inline Function @ 6888b435)

in this case, the first "real" frame is System_Private_CoreLib!System.Threading.FastRandom.Next(<Win32 error 0n318>)+0x74, which looks like:

0:010> !u -gcinfo 0bc8423c 
Normal JIT generated code
System.Threading.FastRandom.Next(Int32)
Begin 0BC841C8, size 7a

C:\gh\runtime2\src\libraries\System.Private.CoreLib\src\System\Threading\ThreadPool.cs @ 695:
Prolog size: 0
Security object: <none>
GS cookie: <none>
PSPSym: <none>
Generics inst context: <none>
PSP slot: <none>
GenericInst slot: <none>
Varargs: 0
Frame pointer: r11
Has tailcalls: 0
Size of parameter area: 0
Return Kind: Scalar
Code size: 7a
0bc841c8 e92d48f0 push        {r4-r7,r11,lr}
0bc841cc f10d0b10 add         r11,sp,#0x10
0bc841d0 4604     mov         r4,r0
0bc841d2 460d     mov         r5,r1
0bc841d4 f2420038 mov         r0,#0x2038
0bc841d8 f6c0009a movt        r0,#0x89A
0bc841dc 6806     ldr         r6,[r0]
0bc841de 4637     mov         r7,r6
0bc841e0 2d00     cmp         r5,#0
0bc841e2 dc15     bgt         0BC84210
0bc841e4 f2400018 movw        r0,#0x18
0bc841e8 f2c0707f movt        r0,#0x77F
0bc841ec f24061cd mov         r1,#0x6CD
0bc841f0 f241635d mov         r3,#0x165D
0bc841f4 f6c60381 movt        r3,#0x6881 (CoreCLR!JIT_GetSharedNonGCStaticBase_SingleAppDomain)
0bc841f8 de03     udf         #3
00000032 is a safepoint: 
00000031 +r7 +r6 +r4(interior)
0bc841fa f6413044 mov         r0,#0x1B44
0bc841fe f6c0009a movt        r0,#0x89A
0bc84202 6800     ldr         r0,[r0]
0bc84204 4639     mov         r1,r7
0bc84206 4632     mov         r2,r6
0bc84208 6803     ldr         r3,[r0]
0bc8420a 6adb     ldr         r3,[r3,#0x2C]
0bc8420c 691b     ldr         r3,[r3,#0x10]
0bc8420e de03     udf         #3

C:\gh\runtime2\src\libraries\System.Private.CoreLib\src\System\Threading\ThreadPool.cs @ 697:
00000048 is a safepoint: 
00000047 +r4(interior)
0bc84210 6860     ldr         r0,[r4,#4]
0bc84212 02c1     lsls        r1,r0,#0xB
0bc84214 4048     eors        r0,r0,r1

C:\gh\runtime2\src\libraries\System.Private.CoreLib\src\System\Threading\ThreadPool.cs @ 698:
0bc84216 68a1     ldr         r1,[r4,#8]
0bc84218 6061     str         r1,[r4,#4]
0bc8421a 68e1     ldr         r1,[r4,#0xC]
0bc8421c 60a1     str         r1,[r4,#8]
0bc8421e 6821     ldr         r1,[r4]
0bc84220 60e1     str         r1,[r4,#0xC]

C:\gh\runtime2\src\libraries\System.Private.CoreLib\src\System\Threading\ThreadPool.cs @ 699:
0bc84222 6821     ldr         r1,[r4]
0bc84224 0ccb     lsrs        r3,r1,#0x13
0bc84226 4059     eors        r1,r1,r3
0bc84228 4041     eors        r1,r1,r0
0bc8422a 0a00     lsrs        r0,r0,#8
0bc8422c 4048     eors        r0,r0,r1
0bc8422e 6020     str         r0,[r4]

C:\gh\runtime2\src\libraries\System.Private.CoreLib\src\System\Threading\ThreadPool.cs @ 701:
0bc84230 6820     ldr         r0,[r4]
0bc84232 4629     mov         r1,r5
0bc84234 f6410351 mov         r3,#0x1851
0bc84238 f6c6038a movt        r3,#0x688A (CoreCLR!JIT_UMod)
>>> 0bc8423c 4798     blx         r3
00000076 is a safepoint: 
0bc8423e e8bd88f0 pop         {r4-r7,r11,pc}

So, we're actually AT the last call instruction, in a non-interruptible region, not after the call instruction.

So, how do we get here? Is there some kind of initial stack walk setup that is getting confused on this call?

What's also interesting is that while we are running with COMPlus_GCStress=C, it looks like we are doing gcstress due to an allocation, not a stressed JIT/R2R instruction, based on the thread that seems to be the one to be triggering the GC

[edit: it looks like when concurrent GC is "forced" (I can't tell why it would be considered "forced" in this case), EEConfig::sync() adds in GCStress 3].

00 053bd210 6881889e KERNELBASE!DebugBreak+0x8
01 053bd218 6896bc00 CoreCLR!DbgAssertDialog(char * szFile = 0x68ca1b40 "C:\gh\runtime2\src\coreclr\src\vm\eetwain.cpp", int iLine = 0n5044, char * szExpr = 0x68ca1b9c "_safePointDecoder.IsSafePoint(curOffs)")+0xcb
02 053bd280 689d0670 CoreCLR!EECodeManager::EnumGcRefs(struct REGDISPLAY * pRD = 0x053bd9a0, class EECodeInfo * pCodeInfo = 0x053bd724, unsigned int flags = 1, <function> * pCallBack = 0x689d03e1, void * hCallBack = 0x053bdde8, unsigned long relOffsetOverride = 0xffffffff)+0xa1
03 053bd410 68870b1c CoreCLR!GcStackCrawlCallBack(class CrawlFrame * pCF = 0x053bd508, void * pData = 0x053bdde8)+0x1c1
04 053bd4d0 68871a50 CoreCLR!Thread::MakeStackwalkerCallback(class CrawlFrame * pCF = 0x053bd508, <function> * pCallback = <Value unavailable error>, void * pData = 0x053bdde8, unsigned int uFramesProcessed = 2)+0x41
05 053bd4f8 68871908 CoreCLR!Thread::StackWalkFramesEx(struct REGDISPLAY * pRD = <Value unavailable error>, <function> * pCallback = 0x689d04b1, void * pData = 0x053bdde8, unsigned int flags = 0x8500, class Frame * pStartFrame = 0x00000000)+0xd5
06 053bd7f8 68a69b7a CoreCLR!Thread::StackWalkFrames(<function> * pCallback = 0x689d04b1, void * pData = 0x053bdde8, unsigned int flags = 0x8500, class Frame * pStartFrame = 0x00000000)+0xe1
07 053bddd8 68a6919c CoreCLR!ScanStackRoots(class Thread * pThread = 0x05731660, <function> * fn = 0x68a1e7f1, struct ScanContext * sc = 0x053bde68)+0x11f
08 053bde18 68a31a56 CoreCLR!GCToEEInterface::GcScanRoots(<function> * fn = 0x68a1e7f1, int condemned = <Value unavailable error>, int max_gen = <Value unavailable error>, struct ScanContext * sc = 0x053bde68)+0x109
09 053bde50 68a2c0a2 CoreCLR!WKS::gc_heap::mark_phase(int condemned_gen_number = 0n1, int mark_only_p = <Value unavailable error>)+0x167
0a 053bdea8 68a2be6c CoreCLR!WKS::gc_heap::gc1(void)+0x15b
0b 053bded0 68a1d9e8 CoreCLR!WKS::gc_heap::garbage_collect(int n = 0n2)+0x2e5
0c 053bdf00 68a1f660 CoreCLR!WKS::GCHeap::GarbageCollectGeneration(unsigned int gen = 2, gc_reason reason = reason_gcstress (0n8))+0x241
0d (Inline) -------- CoreCLR!WKS::GCHeap::GarbageCollectTry(void)+0x17 (Inline Function @ 68a1f661)
0e 053bdf38 68907870 CoreCLR!WKS::GCHeap::StressHeap(struct gc_alloc_context * context = <Value unavailable error>)+0x371
0f (Inline) -------- CoreCLR!_GCStress::StressGcTriggerPolicy::Trigger(void)+0x29 (Inline Function @ 68907871)
10 (Inline) -------- CoreCLR!_GCStress::GCSBase<1,_GCStress::IgnoreFastGcSPolicy,_GCStress::AnyGcModePolicy,_GCStress::StressGcTriggerPolicy>::MaybeTrigger(void)+0x4b (Inline Function @ 68907871)
11 (Inline) -------- CoreCLR!_GCStress::GCStress<10,mpl::null_type,mpl::null_type,mpl::null_type>::MaybeTrigger(void)+0x4b (Inline Function @ 68907871)
12 053bdf98 6890849c CoreCLR!Alloc(unsigned int size = 0x12, GC_ALLOC_FLAGS flags = GC_ALLOC_NO_FLAGS (0n0))+0x95
13 053bdfc0 688a2aaa CoreCLR!AllocateSzArray(class MethodTable * pArrayMT = 0x0b1e78b0, int cElements = 0n6, GC_ALLOC_FLAGS flags = GC_ALLOC_NO_FLAGS (0n0))+0x16d
14 053be000 0b2787ac CoreCLR!JIT_NewArr1(struct CORINFO_CLASS_STRUCT_ * arrayMT = 0x0b1e78b0, int size = 0n6)+0xdb
15 053be0d0 6881104c GitHub_27924!Program.Main(<Win32 error 0n318>)+0x198
16 053be0f8 6891e564 CoreCLR!CallDescrWorkerInternal(void)+0x45
17 053be108 6891e9ae CoreCLR!CallDescrWorker(struct CallDescrData * pCallDescrData = 0x053be608)+0x99
18 053be530 6891bafa CoreCLR!MethodDescCallSite::CallTargetWorker(unsigned int64 * pArguments = 0x053be688, unsigned int64 * pReturnValue = <Value unavailable error>, int cbReturnValue = <Value unavailable error>)+0x35b
19 (Inline) -------- CoreCLR!MethodDescCallSite::Call_RetArgSlot(void)+0x41 (Inline Function @ 6891bafb)
1a 053be668 6891b858 CoreCLR!RunMainInternal(struct Param * pParam = 0x053be798)+0x193
1b (Inline) -------- CoreCLR!RunMain::__l30::__Body::Run::__l5::__Body::Run(void)+0x7 (Inline Function @ 6891b859)
1c 053be760 6891b8fa CoreCLR!`RunMain'::`30'::__Body::Run(struct Param * __EXparam = 0x053be798)+0x2d
1d 053be788 6891a72a CoreCLR!RunMain(class MethodDesc * pFD = 0x0b1e181c, short numSkipArgs = 0n1, int * piRetVal = 0x053be7e8, class REF<PtrArray> * stringArgs = 0x053beaf4)+0x8f

Other information at this point:

0:000> ?? _safePointDecoder
class GcInfoDecoder
   +0x000 m_Reader         : BitStreamReader
   +0x010 m_InstructionOffset : 0
   +0x014 m_IsInterruptible : 0
   +0x015 m_IsVarArg       : 0
   +0x016 m_GenericSecretParamIsMD : 0
   +0x017 m_GenericSecretParamIsMT : 0
   +0x018 m_HasTailCalls   : 0
   +0x01c m_SecurityObjectStackSlot : 0n-1
   +0x020 m_GSCookieStackSlot : 0n-1
   +0x024 m_ReversePInvokeFrameStackSlot : 0n-1
   +0x028 m_ValidRangeStart : 0
   +0x02c m_ValidRangeEnd  : 0
   +0x030 m_PSPSymStackSlot : 0n-1
   +0x034 m_GenericsInstContextStackSlot : 0n-1
   +0x038 m_CodeLength     : 0x7a
   +0x03c m_StackBaseRegister : 0xb
   +0x040 m_SizeOfEditAndContinuePreservedArea : 0xffffffff
   +0x044 m_ReturnKind     : 0 ( RT_Scalar )
   +0x048 m_NumSafePoints  : 3
   +0x04c m_SafePointIndex : 1
   +0x050 m_NumInterruptibleRanges : 0
   +0x054 m_SizeOfStackOutgoingAndScratchArea : 0
   +0x058 m_Flags          : 0 ( DECODE_EVERYTHING )
   +0x05c m_GcInfoAddress  : 0x0b508270  "???"
   +0x060 m_Version        : 2

@janvorli @AndyAyersMS Any suggestions for what to look at next?

@BruceForstall
Copy link
Member

There's something strange about the EE suspension. Through multiple GCs, there are 3 threads waiting for System.Threading.FastRandom.Next(Int32) to JIT; GC stack walks show ExplicitFrame in the PrestubMethodFrame. Then, in the next, failing case, all these three are sitting on the same instruction inside Next(), which is on a call instruction and not a GC safe point.

1f4c  20.646296354 : `SYNC`               In RedirectedHandledJITcase reason 0x3 pFrame = 0B4EEEE4 pc = 0BC8423D sp = 0B4EEF50 fp = 0B4EEF60
2ad8  20.646248281 : `SYNC`               In RedirectedHandledJITcase reason 0x3 pFrame = 0B72F0F4 pc = 0BC8423D sp = 0B72F160 fp = 0B72F170
28d0  20.646222240 : `SYNC`               In RedirectedHandledJITcase reason 0x3 pFrame = 0B8EF2F4 pc = 0BC8423D sp = 0B8EF360 fp = 0B8EF370

Two of these continue and return to the caller, a safe point, before suspension ends. The third, TID 28d0, does not, so we end up suspended but not at a safe point.

StressLog_top.txt

@janvorli
Copy link
Member

I can see that GC stress mode C injects the special invalid instructions at call sites too, but it should be detected as a special case (the instruction is different for call sites).

Can you see any local variable values at the following frame? I would like to see if it was correctly detected as a call site (atCall being true) - it seems it would not be the case. If we cannot get any details from the dump, we might need to add some stresslog logging here to give us more details, or alternatively try to repro the problem in debug build

0f 0b8ef228 6888ffc6 CoreCLR!DoGcStress(struct _CONTEXT * regs = <Value unavailable error>, class NativeCodeVersion * nativeCodeVersion = <Value unavailable error>)+0x251

I was wondering if by some mistake, the GCCoverageInfo::SprinkleBreakpoints doesn't put in an incorrect invalid instruction that would indicate it is a safe point instead of a call.

Looking at the code in replaceSafePointInstructionWithGcStressInstr that gets called from the GCCoverageInfo::SprinkleBreakpoints, it seems it should work correctly for our call instruction (instr code 0x4798):

// call by register instruction is two bytes (BL<c> Reg T1 encoding)
WORD instr = *((WORD*)savedInstrPtr - 1);
instr = instr & 0xff87;
if((instr ^ 0x4780) == 0)
// It is call by register
instructionIsACallThroughRegister = TRUE;
// call using immediate instructions are 4 bytes (BL<c> <label> T1 encoding)
instr = *((WORD*)savedInstrPtr - 2);
instr = instr & 0xf800;
if((instr ^ 0xf000) == 0)
if((*(((WORD*)savedInstrPtr)-1) & 0xd000) == 0xd000)
// It is call by immediate
instructionIsACallThroughImmediate = TRUE;

Although there seems to actually be a bug in this code. When it detects a call through immediate, it sets instructionIsACallThroughRegister instead of instructionIsACallThroughImmediate. But this should not cause the issue we are seeing.
We also really need to do the 2nd check (for calls through immediate) only if the first check for calls through register fails, but that's again not a bug.

@BruceForstall
Copy link
Member

Unfortunately, I lost the above repro and didn't create a dump. I tried repro'ing again with more verbose stress log settings, but it didn't seem to repro, so I kept dialing back the settings and after many hours I was able to get another repro. Similar, but not identical, this time 2 threads starting at the same location, and one gets to a post-call safe point but the other doesn't. I saved a minidump this time. However, (almost) no locals in DoGcStress() function.

It sounds like one of your theories is that multiple threads are on the same gcstress injected illegal call instruction, and one doesn't get replaced properly when resuming or replacing that instruction?

I was wondering if there was possibly a non-gcstress related gc suspension issue, but it would make more sense if the issue was gcstress only.

@janvorli
Copy link
Member

It sounds like one of your theories is that multiple threads are on the same gcstress injected illegal call instruction, and one doesn't get replaced properly when resuming or replacing that instruction?

What I really meant is that if the call stack is not misleading and the DoGcStress was really called for PC at the call, then it seems that the invalid instruction that was put at the call was not INTERRUPT_INSTR_CALL or INTERRUPT_INSTR_CALL_32, but INTERRUPT_INSTR or INTERRUPT_INSTR_32. Otherwise the DoGcStress would detect it was the call related one, set atCall=true and thus it would never reach the GCHeap::StressHeap call.

@v-haren
Copy link
Author

v-haren commented Jul 1, 2020

failed again in job: runtime-coreclr jitstress 20200629.1

failed test: JIT\Regression\JitBlue\GitHub_27924\GitHub_27924\GitHub_27924.cmd

Error message

Assert failure(PID 3384 [0x00000d38], Thread: 3564 [0x0dec]): _safePointDecoder.IsSafePoint(curOffs)

<no module>! <no symbol> + 0x0 (0x00000000)
 File: F:\workspace\_work\1\s\src\coreclr\src\vm\eetwain.cpp Line: 5044
 Image: D:\h\w\A922090B\p\CoreRun.exe


Return code: 1
Raw output file: D:\h\w\A922090B\w\9C2E08BC\e\JIT\Regression\Reports\JIT.Regression\JitBlue\GitHub_27924\GitHub_27924\GitHub_27924.output.txt
Raw output:
BEGIN EXECUTION
 "D:\h\w\A922090B\p\corerun.exe" GitHub_27924.dll 
Expected: 100
Actual: -1073740286
END EXECUTION - FAILED
FAILED
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=D:\h\w\A922090B\p
> D:\h\w\A922090B\w\9C2E08BC\e\JIT\Regression\JitBlue\GitHub_27924\GitHub_27924\GitHub_27924.cmd
Expected: True
Actual: False


Stack trace
   at JIT_Regression._JitBlue_GitHub_27924_GitHub_27924_GitHub_27924_._JitBlue_GitHub_27924_GitHub_27924_GitHub_27924_cmd() in F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.arm.Checked\TestWrappers\JIT.Regression\JIT.Regression.XUnitWrapper.cs:line 122360

@safern safern added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Jul 2, 2020
BruceForstall added a commit to BruceForstall/runtime that referenced this issue Jul 7, 2020
@AndyAyersMS
Copy link
Member

@dotnet/jit-contrib we need somebody to look at this.

@BruceForstall
Copy link
Member

It looks to me like replaceSafePointInstructionWithGcStressInstr, called to insert breakpoints in GCCoverageInfo::SprinkleBreakpoints as the callback to GcInfoDecoder::EnumerateSafePoints, has a bug for arm32 (in addition to what Jan already mentioned above): it takes a pointer to an instruction, subtracts 2, reads 2 bytes, and checks to see if this is the encoding for a 2-byte call via register (bl reg). This isn't safe, because the previous instruction could be a 4-byte instruction whose final 2 bytes just happen to match this pattern, such as a "movw". Admittedly, the chance this happens is very low. Technically, we also shouldn't look backwards for offset zero, but we never generate a call as the very first instruction in a function, so that isn't a real problem.

@AndyAyersMS
Copy link
Member

Seems like we should re-enable this in a draft PR and see if we can catch some more failure instances? Or, I can try and repro this locally, though it might be a few days before I can get around to it.

@BruceForstall
Copy link
Member

I can relatively easily reproduce this locally (looking now).

@AndyAyersMS
Copy link
Member

Seeing if I can repro on Linux arm32.

@AndyAyersMS
Copy link
Member

No luck after several thousand iterations.

@BruceForstall you mentioned altering the test to make the failures repro more readily?

@BruceForstall
Copy link
Member

I simply upped the loops, from 1000000 to 100_000_000, and 10000 to 1_000_000. I presume you're already running the COMPlus_GCStress=C.

I'm also running with:

COMPlus_TieredCompilation=0
COMPlus_ReadyToRun=0
COMPlus_ZapDisable=1

which might all be required.

@BruceForstall
Copy link
Member

I found a race condition in the GCStress infrastructure which I believe causes this problem. The details are related to what Jan suggested above about a problem with atCall, as follows:

Consider the case of a partially interruptible function where we've replaced each call site with INTERRUPT_INSTR_CALL or INTERRUPT_INSTR_CALL_32. Even though the call site itself isn't a safe point, we put a GCStress instruction there so we can capture the call target at GC time, then set a specific GCStress instruction for the next instruction that will indicate how to GC protect any GC refs that are returned from the call.

Multiple threads can enter DoGcStress at the same time. They all get just past the race check if (!IsGcCoverageInterruptInstruction(instrPtr)) return;. One goes ahead, and reads the instruction code for the address at which we're doing a GCStress. We set atCall to true if this matches one of our known GCStress instructions for partially-interruptible call sites. Later, if atCall is true, we write the original call instruction back to the code stream, write the distinguished next instruction GCStress instruction, and continue. If the other threads then read at the instruction address, they might read the actual call opcode, not the distinguished GCStress breakpoint instruction code. Then, they will set atCall to false. This will indicate that we are in a fully-interruptible location, or that no special call-site behavior is required. We go on, force a GC, and eventually get to this assert in EECodeManager::EnumGcRefs that we are at a GC safe point.

It seems like this should be an all-platform issue, so I'm not sure why we only see it on Windows arm32 (except that it's very rare, and this particular test that exposes it (rarely) is run with GCStress much more frequently on Windows than Linux, and arm32 is probably a lot slower and potentially exposes the race more as a result).

It seems like the solution should be to simply move the if (!IsGcCoverageInterruptInstruction(instrPtr)) check down below the read of *instrPtr to set atCall.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Aug 28, 2020

No luck after several thousand iterations.

I couldn't repro with upped loop counts and also with some sideloaded stress.

I'm not sure why we only see it on Windows arm32

Note for arm32 we make multiple reads of instrPtr so perhaps the race window is much larger? It also means the fix is a perhaps bit more complex; we need to be careful to make just one read, then (perhaps?) use interlocked ops to try and update, so there's just one "winner".

@BruceForstall
Copy link
Member

Note for arm32 we make multiple reads of instrPtr so perhaps the race window is much larger?

The reason we read multiple times is to determine if we've got a 2-byte or a 4-byte instruction encoding, using GetARMInstructionLength. I don't think there is any other case of reading more than once. However, I don't think that matters because we always replace a 2-byte instruction with a 2-byte instruction and a 4-byte instruction with a 4-byte instruction, so the result of calling GetARMInstructionLength shouldn't change depending on whether we're reading the original or a new instruction. Perhaps there is some effect on the race window simply by reading twice.

@BruceForstall
Copy link
Member

With my fix of moving the (!IsGcCoverageInterruptInstruction(instrPtr)) check, I've run >500 iterations without a failure. Before, it failed consistently in the first 10 or 20 iterations.

@AndyAyersMS
Copy link
Member

I suppose we won't ever see torn reads so you are probably right about re-reads being fairly harmless.

I've run >500 iterations without a failure

That's great news. Thanks for digging into this one.

@BruceForstall
Copy link
Member

There definitely are assumptions in the GCStress code about not seeing torn reads, and that any *instrPtr read will be atomic. I'm not sure how valid these assumptions are, and it does seem that switching to use interlocked operations might be safer, although perhaps with some significant performance penalty.

BruceForstall added a commit to BruceForstall/runtime that referenced this issue Aug 28, 2020
Consider the case of a partially interruptible function where we've replaced
each call site with INTERRUPT_INSTR_CALL or INTERRUPT_INSTR_CALL_32. Even
though the call site itself isn't a safe point, we put a GCStress instruction
there so we can capture the call target at GC stress time, then set a specific
GCStress instruction for the next instruction that will indicate how to GC
protect any GC refs that are returned from the call.

Multiple threads can enter `DoGcStress` at the same time. They all get just
past the race check `if (!IsGcCoverageInterruptInstruction(instrPtr)) return;`.
One goes ahead, and reads the instruction code for the address at which we're
doing a GCStress. We set `atCall` to `true` if this matches one of our known
GCStress instructions for partially-interruptible call sites. Later, if
`atCall` is `true`, we write the original call instruction back to the code
stream, write the distinguished next instruction GCStress instruction, and
continue. If the other threads then read at the instruction address, they
might read the actual call opcode, not the distinguished GCStress breakpoint
instruction code. Then, they will set `atCall` to `false`. This will indicate
that we are in a fully-interruptible location, or that no special call-site
behavior is required. We go on, force a GC, and eventually get to the assert
in `EECodeManager::EnumGcRefs` that we are at a GC safe point.

The solution is to simply move the `if (!IsGcCoverageInterruptInstruction(instrPtr))`
check below the read of `*instrPtr` to set `atCall`.

Re-enable the GitHub_27924 test.

Fixes dotnet#36681
@BruceForstall
Copy link
Member

Since this is GCStress only, moving to 6.0.

@BruceForstall BruceForstall modified the milestones: 5.0.0, 6.0.0 Aug 28, 2020
BruceForstall added a commit that referenced this issue Aug 29, 2020
Consider the case of a partially interruptible function where we've replaced
each call site with INTERRUPT_INSTR_CALL or INTERRUPT_INSTR_CALL_32. Even
though the call site itself isn't a safe point, we put a GCStress instruction
there so we can capture the call target at GC stress time, then set a specific
GCStress instruction for the next instruction that will indicate how to GC
protect any GC refs that are returned from the call.

Multiple threads can enter `DoGcStress` at the same time. They all get just
past the race check `if (!IsGcCoverageInterruptInstruction(instrPtr)) return;`.
One goes ahead, and reads the instruction code for the address at which we're
doing a GCStress. We set `atCall` to `true` if this matches one of our known
GCStress instructions for partially-interruptible call sites. Later, if
`atCall` is `true`, we write the original call instruction back to the code
stream, write the distinguished next instruction GCStress instruction, and
continue. If the other threads then read at the instruction address, they
might read the actual call opcode, not the distinguished GCStress breakpoint
instruction code. Then, they will set `atCall` to `false`. This will indicate
that we are in a fully-interruptible location, or that no special call-site
behavior is required. We go on, force a GC, and eventually get to the assert
in `EECodeManager::EnumGcRefs` that we are at a GC safe point.

The solution is to simply move the `if (!IsGcCoverageInterruptInstruction(instrPtr))`
check below the read of `*instrPtr` to set `atCall`.

Re-enable the GitHub_27924 test.

Fixes #36681
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm32 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' JitStress CLR JIT issues involving JIT internal stress modes os-windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants