Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm32 Libraries tests failures with full PGO #53549

Closed
AndyAyersMS opened this issue Jun 1, 2021 · 11 comments · Fixed by #56126
Closed

Arm32 Libraries tests failures with full PGO #53549

AndyAyersMS opened this issue Jun 1, 2021 · 11 comments · Fixed by #56126
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

From https://dev.azure.com/dnceng/public/_build/results?buildId=1163507&view=ms.vss-test-web.build-test-results-tab,

fullpgo arm32 failures in a number of array enumerator tests:

    System.Tests.ArrayEnumeratorTests_int.IEnumerable_Generic_Enumerator_Reset_BeforeIteration_Support(count: 0) [FAIL]
      System.NullReferenceException : Object reference not set to an instance of an object.
      Stack Trace:
        /_/src/libraries/System.Runtime/tests/System/ArraySegmentTests.cs(44,0): at System.Tests.ArraySegment_Tests`1.Factory(Int32 count, Int32 offset, Int32 length)
        /_/src/libraries/System.Runtime/tests/System/ArraySegmentTests.cs(24,0): at System.Tests.ArraySegment_Tests`1.GenericIListFactory(Int32 count)
        /_/src/libraries/Common/tests/System/Collections/IList.Generic.Tests.cs(93,0): at System.Collections.Tests.IList_Generic_Tests`1.GenericICollectionFactory(Int32 count)
        /_/src/libraries/Common/tests/System/Collections/ICollection.Generic.Tests.cs(67,0): at System.Collections.Tests.ICollection_Generic_Tests`1.GenericIEnumerableFactory(Int32 count)
        /_/src/libraries/Common/tests/System/Collections/IEnumerable.Generic.Tests.cs(663,0): at System.Collections.Tests.IEnumerable_Generic_Tests`1.IEnumerable_Generic_Enumerator_Reset_BeforeIteration_Support(Int32 count)
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels Jun 1, 2021
@ghost
Copy link

ghost commented Jun 1, 2021

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

From https://dev.azure.com/dnceng/public/_build/results?buildId=1163507&view=ms.vss-test-web.build-test-results-tab,

fullpgo arm32 failures in a number of array enumerator tests:

    System.Tests.ArrayEnumeratorTests_int.IEnumerable_Generic_Enumerator_Reset_BeforeIteration_Support(count: 0) [FAIL]
      System.NullReferenceException : Object reference not set to an instance of an object.
      Stack Trace:
        /_/src/libraries/System.Runtime/tests/System/ArraySegmentTests.cs(44,0): at System.Tests.ArraySegment_Tests`1.Factory(Int32 count, Int32 offset, Int32 length)
        /_/src/libraries/System.Runtime/tests/System/ArraySegmentTests.cs(24,0): at System.Tests.ArraySegment_Tests`1.GenericIListFactory(Int32 count)
        /_/src/libraries/Common/tests/System/Collections/IList.Generic.Tests.cs(93,0): at System.Collections.Tests.IList_Generic_Tests`1.GenericICollectionFactory(Int32 count)
        /_/src/libraries/Common/tests/System/Collections/ICollection.Generic.Tests.cs(67,0): at System.Collections.Tests.ICollection_Generic_Tests`1.GenericIEnumerableFactory(Int32 count)
        /_/src/libraries/Common/tests/System/Collections/IEnumerable.Generic.Tests.cs(663,0): at System.Collections.Tests.IEnumerable_Generic_Tests`1.IEnumerable_Generic_Enumerator_Reset_BeforeIteration_Support(Int32 count)
Author: AndyAyersMS
Assignees: -
Labels:

area-System.Collections, untriaged

Milestone: -

@AndyAyersMS AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Collections untriaged New issue has not been triaged by the area owner labels Jun 1, 2021
@AndyAyersMS AndyAyersMS self-assigned this Jun 1, 2021
@AndyAyersMS AndyAyersMS added this to the 6.0.0 milestone Jun 1, 2021
@AndyAyersMS
Copy link
Member Author

Also whenever I run on arm via a checked corerun I get a slew of unwanted output from the runtime, some sort of directory probing :

The specified managed assembly is not a file: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root//.
The specified managed assembly is not a file: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root//..
The specified managed assembly is not a file: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root//crossgen2
The specified managed assembly is not a file: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root//IL
The specified managed assembly is not a file: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root//R2RDump
The specified managed assembly is not a file: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root//xunit
The specified managed assembly is not a file: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root//R2RTest
The specified managed assembly is not a file: /mnt/laptop/repos/runtime/artifacts/tests/coreclr/Linux.arm.Checked/Tests/Core_Root//PDB

@AndyAyersMS
Copy link
Member Author

Trying to repro but getting other random-looking failures, currently.

@AndyAyersMS
Copy link
Member Author

I can't repro the issues and from the test history it looks like these tests are not always failing. So will wait for this weekend's runs to see if any clearer pattern emerges.

@AndyAyersMS
Copy link
Member Author

Looking at the last weekend runs the sets of failures seem to fluctuate, but there's a common failure mode where an object is unexpectedly null. This seems to happen across all architectures (though more commonly for arm/arm64) so likely is some high level transformation.

See eg https://dev.azure.com/dnceng/public/_build/results?buildId=1173163&view=ms.vss-test-web.build-test-results-tab

@AndyAyersMS
Copy link
Member Author

@AndyAyersMS
Copy link
Member Author

Not sure if it's the same issue but I can get libraries tests to fail with PGO enabled if I add -verbose to the xunit args... the initial tests run fine but eventually you get a whole raft of failures, eg:

set COMPlus_TC_QuickJitForLoops=1
set COMPlus_TieredPGO=1

C:\repos\runtime0\artifacts\bin\testhost\net6.0-windows-Release-x64\dotnet.exe exec --runtimeconfig System.Data.Common.Tests.runtimeconfig.json --depsfile System.Data.Common.Tests.deps.json xunit.console.dll System.Data.Common.Tests.dll -xml testResults.xml -nologo -notrait category=OuterLoop -notrait category=failing -verbose
  Discovering: System.Data.Common.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Data.Common.Tests (found 1811 of 1813 test cases)
  Starting:    System.Data.Common.Tests (parallel test collections = on, max threads = 12)
...
System.Data.Tests.DataColumnExpressionTest.BinaryOperator(operandType1: typeof(uint), operandType2: typeof(sbyte), resultType: typeof(byte), expression: "Operand1 + Operand2", operand1: 10, operand2: , result: ) [STARTING]
    [FATAL ERROR] System.NullReferenceException
      System.NullReferenceException : Object reference not set to an instance of an object.
    System.Data.Tests.DataColumnExpressionTest.BinaryOperator(operandType1: typeof(uint), operandType2: typeof(sbyte), resultType: typeof(byte), expression: "Operand1 - Operand2", operand1: 10, operand2: 2, result: 8) [STARTING]
    [FATAL ERROR] System.NullReferenceException
      System.NullReferenceException : Object reference not set to an instance of an object.
    System.Data.Tests.DataColumnExpressionTest.BinaryOperator(operandType1: typeof(uint), operandType2: typeof(sbyte), resultType: typeof(byte), expression: "Operand1 - Operand2", operand1: 10, operand2: , result: ) [STARTING]
    [FATAL ERROR] System.NullReferenceException
      System.NullReferenceException : Object reference not set to an instance of an object.
    System.Data.Tests.DataColumnExpressionTest.BinaryOperator(operandType1: typeof(uint), operandType2: typeof(sbyte), resultType: typeof(byte), expression: "Operand1 * Operand2", operand1: 10, operand2: 2, result: 20) [STARTING]
    [FATAL ERROR] System.NullReferenceException
      System.NullReferenceException : Object reference not set to an instance of an object.
    System.Data.Tests.DataColumnExpressionTest.BinaryOperator(operandType1: typeof(uint), operandType2: typeof(sbyte), resultType: typeof(byte), expression: "Operand1 * Operand2", operand1: 10, operand2: , result: ) [STARTING]

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 20, 2021

For the -verbose issue, when we jit VerboseReporterMessageHandler:<.ctor>b__0_1(MessageHandlerArgs`1):this (MethodHash=e788d62d) with PGO, we end up producing:

IN0016: 000091 vmovupd  xmmword ptr [0000H], xmm0

and not surprisingly this leads to an NRE at runtime.

@AndyAyersMS
Copy link
Member Author

Looks like this is a bug in the GDV expansion for a call returning a struct value that is the payload of a box. Normally we'd allocate the box object and then make the call, but with GDV we can end up making the call first.

More details forthcoming...

@AndyAyersMS
Copy link
Member Author

Here's a simple repro.

using System;
using System.Runtime.CompilerServices;
using System.Threading;

interface I
{
    public Decimal F();
}

class X : I
{
    Decimal z;

    public Decimal F() => z;

    public static bool G(object o) 
    {
        return ((decimal) o).Equals(100M);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int H(I i)
    {
        return G(i.F()) ? 100 : -1;
    }

    public static int Main()
    {
        X x = new X();
        x.z = 100M;

        for (int i = 0; i < 100; i++)
        {
            _ = H(x);
            Thread.Sleep(15);
        }

        return H(x);
    }
}

When H is recompiled at Tier1, GDV kicks in, and we end up with bad code:

       48B8D0A2141AF97F0000 mov      rax, 0x7FF91A14A2D0     // gdv test for i
       483901               cmp      qword ptr [rcx], rax
       0F85BD000000         jne      G_M22385_IG13
       C5F9104108           vmovupd  xmm0, xmmword ptr [rcx+8]    // blown box init
       C5F911042500000000   vmovupd  xmmword ptr [0000H], xmm0

and hence an exception

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at X.H(I i) in repro.exe:token 0x6000004+0x0
   at X.Main() in repro.exe:token 0x6000005+0x17

@AndyAyersMS
Copy link
Member Author

cc @EgorBo in case you're interested or have run across this ... still looking for a fix that does not require losing some GDV.

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jul 22, 2021
If a call is a GDV candidate and returns a struct via hidden buffer, and that
return value is immediately boxed, the GDV expansion will produce IR in
incorrect order, leading to bad codegen.

This seems to be a rare enough sequence that disabling GDV is a reasonable
workaround for now.

Actually the box expansion is producing IR in the wrong order and GDV fails
to fix the order (unlike inlining, which does fix the order).

Longer term we should avoid producing out of order IR. But that seems a bit
more complicated and may have other CQ impact.

Added a test case.

Closes dotnet#53549.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 22, 2021
AndyAyersMS added a commit that referenced this issue Jul 22, 2021
If a call is a GDV candidate and returns a struct via hidden buffer, and that
return value is immediately boxed, the GDV expansion will produce IR in
incorrect order, leading to bad codegen.

This seems to be a rare enough sequence that disabling GDV is a reasonable
workaround for now.

Actually the box expansion is producing IR in the wrong order and GDV fails
to fix the order (unlike inlining, which does fix the order).

Longer term we should avoid producing out of order IR. But that seems a bit
more complicated and may have other CQ impact.

Added a test case.

Closes #53549.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 22, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant