Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT/Performance/CodeQuality/Serialization/Deserialize/Deserialize.dll fails under gcstress on linux-arm64 #88618

Closed
AndyAyersMS opened this issue Jul 10, 2023 · 16 comments · Fixed by #89678
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs disabled-test The test is disabled in source code against the issue GCStress Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro)
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jul 10, 2023

https://dev.azure.com/dnceng-public/public/_build/results?buildId=333077&view=ms.vss-test-web.build-test-results-tab

Console log: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-db9631cc71ea487795/JIT.performance.0.1/1/console.04cf4a49.log?helixlogtype=result


export DOTNET_TieredCompilation=0
export DOTNET_DbgEnableMiniDump=1
export DOTNET_EnableCrashReport=1
export DOTNET_DbgMiniDumpName=$HELIX_DUMP_FOLDER/coredump.%d.dmp
export DOTNET_GCStress=0xC
...

15:31:17.857 Running test: JIT/Performance/CodeQuality/Serialization/Deserialize/Deserialize.dll
[createdump] Gathering state for process 23 corerun
[createdump] Crashing thread 0017 signal 11 (000b)

{
  "ErrorMessage": "Running test: JIT/Performance/CodeQuality/Serialization/Deserialize/Deserialize.dll\\n[createdump]",
  "BuildRetry": false,
  "ErrorPattern": "",
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=333077
Error message validated: Running test: JIT/Performance/CodeQuality/Serialization/Deserialize/Deserialize.dll\n[createdump]
Result validation: ❌ Known issue did not match with the provided build.
Validation performed at: 7/18/2023 9:44:11 AM UTC

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 10, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 10, 2023
@ghost
Copy link

ghost commented Jul 10, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

https://dev.azure.com/dnceng-public/public/_build/results?buildId=333180&view=ms.vss-test-web.build-test-results-tab


export DOTNET_TieredCompilation=0
export DOTNET_DbgEnableMiniDump=1
export DOTNET_EnableCrashReport=1
export DOTNET_DbgMiniDumpName=$HELIX_DUMP_FOLDER/coredump.%d.dmp
export DOTNET_GCStress=0xC
export DOTNET_JitForceControlFlowGuard=1

+ export TEST_HARNESS_STRIPE_TO_EXECUTE=.0.1
+ chmod +x JIT/Performance/JIT.performance/JIT.performance.sh
+ JIT/Performance/JIT.performance/JIT.performance.sh -usewatc

...

23:34:45.767 Running test: JIT/Performance/CodeQuality/Serialization/Deserialize/Deserialize.dll
[createdump] Gathering state for process 23 corerun
[createdump] Crashing thread 0017 signal 11 (000b)

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS AndyAyersMS added arch-arm64 os-linux Linux OS (any supported distro) GCStress and removed untriaged New issue has not been triaged by the area owner labels Jul 10, 2023
@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Jul 14, 2023
@JulieLeeMSFT
Copy link
Member

@jakobbotsch, PTAL.

@jakobbotsch jakobbotsch changed the title jit-cfg pipline gc stress failure on linux x64 JIT/Performance/CodeQuality/Serialization/Deserialize/Deserialize.dll fails under gcstress on linux-arm64 Jul 18, 2023
@jakobbotsch jakobbotsch added blocking-clean-ci-optional Blocking optional rolling runs Known Build Error Use this to report build issues in the .NET Helix tab labels Jul 18, 2023
@markples
Copy link
Member

The pattern ("ErrorPattern": "Running test: JIT/Performance/CodeQuality/Serialization/Deserialize/Deserialize.dll.{1,2}\\[createdump\\]") isn't working. Attempting to guess intent, I think the .{1,2} is trying to match \n or \r\n so that it matches on Windows or Linux? ErrorPattern appears to be single-line, and ErrorMessage isn't regex, so I'm not sure how to do this. However, in this case, it appears to be Linux-only, so perhaps \n will suffice (not . since single-line regex). I'll see if I can get a \n past the escaping.

@jakobbotsch
Copy link
Member

The pattern ("ErrorPattern": "Running test: JIT/Performance/CodeQuality/Serialization/Deserialize/Deserialize.dll.{1,2}\\[createdump\\]") isn't working. Attempting to guess intent, I think the .{1,2} is trying to match \n or \r\n so that it matches on Windows or Linux? ErrorPattern appears to be single-line, and ErrorMessage isn't regex, so I'm not sure how to do this. However, in this case, it appears to be Linux-only, so perhaps \n will suffice (not . since single-line regex). I'll see if I can get a \n past the escaping.

Right. I first tried with \s+, but that also didn't work. I'm not sure why, since I tested the regex according to the instructions at https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/failure-analysis.md#what-to-do-if-you-determine-the-failure-is-unrelated and it worked.
.{1, 2} was my second attempt.

@markples
Copy link
Member

I don't think ErrorPattern is going to work because a single-line match is performed. \n as an ErrorMessage didn't work - maybe \\n would? There's a level of json escaping (so \n -> actual newline and \\n -> \n?), but I'm not sure what happens next with the string. I see we're fighting over the description now :) so I'll back off.

@jakobbotsch
Copy link
Member

I don't think ErrorPattern is going to work because a single-line match is performed. \n as an ErrorMessage didn't work - maybe \\n would? There's a level of json escaping (so \n -> actual newline and \\n -> \n?), but I'm not sure what happens next with the string. I see we're fighting over the description now :) so I'll back off.

The single-line regex option doesn't mean you cannot match multiple lines, it just affects the behavior of how newlines are matched (https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-options#single-line-mode). So I would still expect us to be able to match both lines with a pattern.

Sorry, I didn't realize you were trying something else, I just noticed the \n didn't work. I would expect that to work in ErrorMessage, so maybe it does need to be \r\n? I can try these suggestions.

@markples
Copy link
Member

Of course you're right, so now I agree with you in not understanding why it isn't working. I guess I still don't know how the string is actually processed (first there's JSON decoding, and that string appears to be displayed in the "Known issue validation" box, but then is some C# dynamically compiled, Console.ReadLine, something else?) and if that has some special requirement.

@markples
Copy link
Member

This is crashing the gcstress run, so we should disable the test anyways.

19/100 tests run.

Though it occurs to me that I don't know how to disable a gcstress test on a particular architecture. I see some <GCStressIncompatible Condition="'$(TargetArchitecture)' == in the tree, but I don't think that works because tests are built under anycpu.

@markples
Copy link
Member

See #89072 but feel free to close it and do something else

@jakobbotsch
Copy link
Member

Thanks @markples. Sounds good to me to disable it.


On topic of the actual issue here: this looks related to #87082. We are running GC during the inlined TLS access sequence when we hit an invalid GC ref:

(lldb) clru -gcinfo 0x0000ffffb8826044
Normal JIT generated code
System.Runtime.Serialization.DataContracts.DataContract+DataContractCriticalHelper.GetId(System.RuntimeTypeHandle)
ilAddr is 0000FFBF8B592764 pImport is 0000000040DA4390
Begin 0000FFFFB8826000, size 484

/runtime/src/libraries/System.Private.DataContractSerialization/src/System/Runtime/Serialization/DataContract.cs @ 395:
Prolog size: 0
Security object: <none>
GS cookie: <none>
PSPSym: caller.sp-30
Generics inst context: <none>
PSP slot: caller.sp-30
GenericInst slot: <none>
Varargs: 0
Frame pointer: Fp
Has tailcalls: 0
Size of parameter area: 0
Return Kind: Scalar
Code size: 484
0000ffffb8826000 fd7bbaa9             stp     x29, x30, [sp, #-0x60]!
0000ffffb8826004 f3d303a9             stp     x19, x20, [sp, #0x38]
0000ffffb8826008 f5db04a9             stp     x21, x22, [sp, #0x48]
0000ffffb882600c f72f00f9             str     x23, [sp, #0x58]
0000ffffb8826010 fd030091             mov     x29, sp
0000ffffb8826014 bf0b00f9             str     xzr, [x29, #0x10]
0000ffffb8826018 e1830191             add     x1, sp, #0x60
0000ffffb882601c a11b00f9             str     x1, [x29, #0x30]
0000ffffb8826020 f30300aa             mov     x19, x0
00000024 interruptible
00000024 +Fp+10 +X19
0000ffffb8826024 e00313aa             mov     x0, x19
00000028 +X0
0000ffffb8826028 01868fd2             mov     x1, #0x7c30
0000ffffb882602c e12eb7f2             movk    x1, #0xb977, lsl #16
0000ffffb8826030 e1ffdff2             movk    x1, #0xffff, lsl #32
0000ffffb8826034 210040f9             ldr     x1, [x1]
0000ffffb8826038 20003fd6             blr     x1
0000003c -X19
0000ffffb882603c f30300aa             mov     x19, x0

/runtime/src/libraries/System.Private.DataContractSerialization/src/System/Runtime/Serialization/DataContract.cs @ 396:
00000040 +X19
0000ffffb8826040 40d03bd5             mrs     x0, TPIDR_EL0
>>> 0000ffffb8826044 01f840b9             ldr     w1, [x0, #0xf8]

We read TPIDR_EL0 into x0, but x0 is marked as containing a GC pointer here and there is no update in GC reporting to unmark that.

cc @kunalspathak

@jakobbotsch
Copy link
Member

JitDisasm:

G_M55734_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
            stp     fp, lr, [sp, #-0x60]!
            stp     x19, x20, [sp, #0x38]
            stp     x21, x22, [sp, #0x48]
            str     x23, [sp, #0x58]
            mov     fp, sp
            str     xzr, [fp, #0x10]    // [V08 loc7]
            add     x1, sp, #96
            str     x1, [fp, #0x30]     // [V37 PSPSym]
            mov     x19, x0
                             ; gcrRegs +[x19]
                                                ;; size=36 bbWeight=1 PerfScore 7.50
G_M55734_IG02:        ; bbWeight=1, gcVars=0000000100000000 {V08}, gcrefRegs=80000 {x19}, byrefRegs=0000 {}, gcvars, byref, isz
                             ; GC ptr vars +{V08 V32}
            mov     x0, x19
                             ; gcrRegs +[x0]
            movz    x1, #0x7C30      // code for System.Runtime.Serialization.DataContracts.DataContract+DataContractCriticalHelper:GetDataContractAdapterTypeHandle(System.RuntimeTypeHandle):System.RuntimeTypeHandle
            movk    x1, #0x6081 LSL #16
            movk    x1, #0xFFFF LSL #32
            ldr     x1, [x1]
            blr     x1
                             ; gcrRegs -[x19]
                             ; gcr arg pop 0
            mov     x19, x0
                             ; gcrRegs +[x19]
            mrs     x0, tpidr_el0
            ldr     w1, [x0, #0xF8]
            cmp     w1, #3
            blt     G_M55734_IG20
            ldr     x0, [x0, #0x100]
                             ; gcrRegs -[x0]
            ldr     x0, [x0, #0x18]
            cbz     x0, G_M55734_IG20
            ldr     x0, [x0]
                             ; byrRegs +[x0]
            add     x20, x0, #16
                             ; byrRegs +[x20]
                                                ;; size=64 bbWeight=1 PerfScore 22.50

@kunalspathak Can you please take a look?

@jakobbotsch jakobbotsch added disabled-test The test is disabled in source code against the issue and removed blocking-clean-ci-optional Blocking optional rolling runs labels Jul 18, 2023
@jakobbotsch
Copy link
Member

Test disabled in #89072

@jakobbotsch
Copy link
Member

It looks like JIT/Performance/CodeQuality/Serialization/Serialize is failing in the same way now that we get past the Deserialize version, so this is still blocking.

@kunalspathak
Copy link
Member

We read TPIDR_EL0 into x0, but x0 is marked as containing a GC pointer here and there is no update in GC reporting to unmark that.

@jakobbotsch - tpidr_el0 is a system register containing current thread local statics pointer and should not be marked as GC pointer, so that is the problem. Perhaps, when I added mrs instruction, I didn't set the right flags to make sure that destination is not a GCref. But I don't see anything like ; gcrRegs +[x0] after mrs instruction. So wondering how you know that the code is thinking that x0 is holding a GC ref?

If you see the JITDump, it is marked as long:

Generating: N015 (  1,  2) [000018] Hc---------                   t18 =    CNS_INT(h) long   0 tls REG NA
                                                                        /--*  t18    long   
Generating: N017 (  5,  5) [000019] DA---------                         *  STORE_LCL_VAR long   V04 rat1          x14 REG x14
IN0006:             mrs     x14, tpidr_el0
							V04 in reg x14 is becoming live  [000019]
							Live regs: 80000 {x19} => 84000 {x14 x19}
				Live vars after [000019]: {V02} => {V02 V04}
Generating: N019 (  3,  2) [000022] -----------                   t22 =    LCL_VAR   long   V04 rat1          x14 REG x14
                                                                        /--*  t22    long   
Generating: N021 (  5,  5) [000023] -c---------                   t23 = *  LEA(b+248) long   REG NA
                                                                        /--*  t23    long   
Generating: N023 (  6,  4) [000024] #----------                   t24 = *  IND       int    REG x15
IN0007:             ldr     w15, [x14, #0xF8]

Also remind me, in the dump >>> means the GC was triggered at that instruction (ldr) or the previous instruction (mrs)?

@markples
Copy link
Member

Does the +[x0] carry over from the previous definition of x0? (up until the -next- definition which has a -[x0])

@jakobbotsch
Copy link
Member

So wondering how you know that the code is thinking that x0 is holding a GC ref?

Right, it's what Mark said -- you can see it was previously marked as holding a GC ref here:

            mov     x0, x19
                             ; gcrRegs +[x0]

GC reporting is not changed as a result of the call since the call returns a GC ref in x0:

            blr     x1
                             ; gcrRegs -[x19]
                             ; gcr arg pop 0
            mov     x19, x0
                             ; gcrRegs +[x19]

(if the call didn't return a GC ref we would expect x0 to be unmarked here).

Also remind me, in the dump >>> means the GC was triggered at that instruction (ldr) or the previous instruction (mrs)?

The GC was triggered at the ldr instruction. The >>> is displayed because I did clru with that address -- I extracted the address from the current GC related function that is on top of the stack with the crash happens.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 29, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 31, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Aug 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs disabled-test The test is disabled in source code against the issue GCStress Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants