Fix regression test 46239 with Crossgen2 and improve runtime logging #51416

trylek · 2021-04-16T22:07:18Z

The regression test

src\tests\JIT\Regressions\JitBlue\Runtime_46239

exercises various interesting corner cases of type layout that
weren't handled properly in Crossgen2 on x86 and ARM[32]. This
change fixes the remaining deficiencies and it also adds
provisions for better runtime logging upon type layout mismatches.

Thanks

Tomas

/cc @dotnet/crossgen-contrib

P.S. For now I don't expect we need to push this to Preview 4 bits
as the change covers a rather corner case on the less prevalent
architectures; please let me know if you believe otherwise.

davidwrighton · 2021-04-16T23:22:08Z

src/coreclr/tools/Common/TypeSystem/Common/MetadataFieldLayoutAlgorithm.cs

@@ -304,17 +305,19 @@ protected virtual void FinalizeRuntimeSpecificStaticFieldLayout(TypeSystemContex
        {
        }

-        protected static ComputedInstanceFieldLayout ComputeExplicitFieldLayout(MetadataType type, int numInstanceFields)
+        protected virtual bool IsBlittableOrManagedSequential(TypeDesc type) => false;


I suspect the default here should be true. Blittable types have a requirement that they are handled in the same way on all runtimes as they describe native data structures, and by defaulting to false, this leads for a way for the algorithm to misbehave on Native AOT.

I'd prefer to see this either be default true, or make it an abstract method and force the NativeAOT compiler to deal with this when it gets there.

@MichalStrehovsky could you comment here on what you'd like to see.

Maybe let's just move IsManagedSequentialType to here and then this doesn't even need to be virtual and can just do the same thing?

We didn't have trouble with this in .NET Native (where this is missing) so I don't think it matters, but we ported over so many CoreCLR weirdness's to this algorithm over time that I no longer understand how layout is done and would prefer all bugs in it to be also surfaced in crossgen2 (and not have NativeAOT specific bugs that we need to troubleshoot there).

@MichalStrehovsky - Thanks for your feedback. I'm trying to modify the change based on your suggestion but I'm hitting a layering problem - IsBlittable is part of MarshalUtils and that's not in ILCompiler.TypeSystem.ReadyToRun and it seems to me that moving it over brings various bits of interop code into ILCompiler.TypeSystem.ReadyToRun which I suspect to be undesirable as my current understanding is that NativeAOT doesn't use exactly identical interop as Crossgen2 / CoreCLR. What do you think would be the ideal way to resolve this discrepancy?

MarshalUtils are part of ILCompiler.TypeSystem on the NativeAOT side, so I'm not opposed to moving it (and things it depends on) into ILCompiler.TypeSystem.ReadyToRun as well. The split is rather arbitrary and interop is getting tangled into the type system the same way as is on CoreCLR. It was a nice dream to have a separation of concerns there.

https://github.com/dotnet/runtimelab/blob/c8e3158b0981419527b18f3c0259bcf02103e054/src/coreclr/tools/aot/ILCompiler.TypeSystem/ILCompiler.TypeSystem.csproj#L528-L530

The split is rather arbitrary and interop is getting tangled into the type system the same way as is on CoreCLR. It was a nice dream to have a separation of concerns there.

Would there be a way to adjust the runtime side of the things to avoid this mess?

Well, for now I think I have at least managed to make something like an inventory of the various runtime behaviors by implementing them in Crossgen2. In theory we may be able to simplify some of the stuff, I however suspect that some of the subtle distinctions amount to tiny memory use optimizations on the runtime side i.e. something that's very tricky to undo as it may incur working set regressions in arbitrary scenarios.

however suspect that some of the subtle distinctions amount to tiny memory use optimizations on the runtime side

I doubt that it would show up on the radar. The behavior of the field layout algorithm is completely accidental in number of cases, and in fact there are many known situations where it is inefficient for no good reason.

Examples:

Improve Dictionary TryGetValue size/perfomance coreclr#27195 (comment)

Why Nullable<byte> is aligned to 8 bytes instead of 2 bytes as short does? #12977

You're probably the biggest expert in this field so I have no desire to doubt your assessment. My only point is that further independent development of runtime vs. Crossgen2 side of this equation is giving me goosebumps; my current expectation is that we'll ultimately go down the path you yourself suggested in the past - having a way to communicate the field layouts from the compiler to the runtime, that will give us room for experimentation with potential packing optimizations and rid us of the burdensome need to keep both codebases in 100% sync.

davidwrighton

Other than my comment on the IsBlittableOrManagedSequential, I'm pretty happy with this.

BruceForstall · 2021-04-20T22:07:17Z

It looks like this is causing every jitstress job flavor to fail:

https://dev.azure.com/dnceng/public/_build/results?buildId=1097089&view=ms.vss-test-web.build-test-results-tab&runId=33568136&resultId=111051&paneView=debug

BruceForstall · 2021-04-21T16:56:45Z

Any chance you'll get this in soon? there are a lot of test jobs failing due to test49826.

jkotas · 2021-04-25T19:18:40Z

src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs

@@ -2257,7 +2249,7 @@ private bool NeedsTypeLayoutCheck(TypeDesc type)

        private bool HasLayoutMetadata(TypeDesc type)


The one place where this method is used looks like dead/unrechable code.

This method can only return true when type.IsValueType is true, but this case is handled in EncodeFieldBaseOffset earlier via the else if (pMT.IsValueType) check. It means that this method won't ever return true.

Nice cleanup, thanks Jan for pointing that out, deleted in 9th commit.

jkotas · 2021-04-25T20:04:43Z

src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs

-            {
-                PreventRecursiveFieldInlinesOutsideVersionBubble(field, callerMethod);
-
-                // We won't try to be smart for classes with layout.


Actually, the old crossgen has a special case for this with this comment that kicks in for non-value types (different from what was here before that kicked in for value types).

Do we want to match it? Or are we confident that we have extensive test coverage for all non-value types with layout corner cases?

Also, once we add this back, can at least some of the remaining changes be simplified?

According to my understanding of the CoreCLR code, the layout check is based on the following condition:

runtime/src/coreclr/vm/methodtablebuilder.cpp

Line 11924 in 207b03a

BOOL fHasLayout =

It seems to me that this means that the conditional statement you originally pointed out should basically stay, the problematic bit is the method HasLayoutMetadata itself - we're certainly trying to match the CoreCLR runtime behavior, in this particular case we probably just missed the fact that the CoreRT version of HasLayoutMetadata is quite different than its CoreCLR runtime counterpart method

runtime/src/coreclr/vm/methodtablebuilder.cpp

Line 11746 in 207b03a

BOOL HasLayoutMetadata(Assembly* pAssembly, IMDInternalImport* pInternalImport, mdTypeDef cl, MethodTable* pParentMT, BYTE* pPackingSize, BYTE* pNLTType, BOOL* pfExplicitOffsets)

I'll look into putting HasLayoutMetadata back and modifying it based on the CoreCLR runtime version. Other than that, can you please be more specific regarding the simplifications you're suggesting? While I do agree that some of the runtime logic is complicated and somewhat arbitrary, I'm worried that unsettling it at this point might cause more harm than good.

I think the equivalent of the HasLayoutMetadata check should be just type.IsSequentialLayout || type.IsExplicitLayout.

Yes, the runtime logic is complicated. We have done multiple rounds of trying to replicate it, and we are still finding mismatches. It is safe bet that this is not the last round of fixes. I think we need to make it simpler to get confident that it matches. I will do some ad-hoc testing to try to find more situations where it does not match.

Thanks Jan for your additional feedback. I have updated the runtime check based on your suggestion. Once the change passes basic testing, I'll also trigger another round of Pri1 testing that previously uncovered most of these inconsistencies. It would be certainly awesome if you were willing to use your vast expertise in this area to devise additional tests exercising various obscure corner cases we may have previously missed.

jkotas · 2021-04-26T23:29:41Z

src/coreclr/tools/Common/TypeSystem/Common/MetadataFieldLayoutAlgorithm.cs

@@ -857,7 +915,9 @@ private static SizeAndAlignment ComputeInstanceSize(MetadataType type, LayoutInt
            {
                if (type.IsValueType)
                {
-                    instanceSize = LayoutInt.AlignUp(instanceSize, alignment, target);
+                    instanceSize = LayoutInt.AlignUp(instanceSize,
+                        alignUpInstanceByteSize ? alignment : LayoutInt.Min(alignment, target.LayoutPointerSize),


I assume that the alignUpInstanceByteSize = false case will kick in for a struct like this on ARM32:

[StructLayout(LayoutKind.Explicit)] internal struct S3 { [FieldOffset(0)] public ulong tmp1; [FieldOffset(8)] public Object tmp2; }

The algorithm will compute the instanceSize of this struct as 12. Is that correct?

It looks like a bug in the type loader. The instance size of this struct on ARM should be 16, so that the long field is aligned at 8 bytes, so that potential atomic 64-bit operations work fine on it. It would be better to fix the type loader instead of replicating the bug here.

jkotas · 2021-04-26T23:42:26Z

src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs

-            }
-
-            return false;
+            return MetadataFieldLayoutAlgorithm.IsBlittableOrManagedSequentialType(type) || (type is MetadataType mt && mt.IsExplicitLayout);


I would simplify this to mt.IsExplicitLayout || mt.IsSequentialLayout. Maybe even inline it at the callsite.

jkotas · 2021-04-27T00:07:27Z

src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunMetadataFieldLayoutAlgorithm.cs

@@ -803,7 +803,7 @@ protected override ComputedInstanceFieldLayout ComputeInstanceFieldLayout(Metada
                return ComputeExplicitFieldLayout(type, numInstanceFields);
            }
            else
-            if (type.IsEnum || MarshalUtils.IsBlittableType(type) || IsManagedSequentialType(type))
+            if (type.IsEnum || IsBlittableOrManagedSequentialType(type))


To get rid of dependency of the type layout algorithm on blittability here, I think we would simplify this into:

if (type.IsSequential && type.ContainsPointers)

(with matching change in the runtime). It would be R2R breaking change, but it would set us up for future.

The dependency of the layout algorithm on blittability makes changes like #103 R2R breaking changes. We did not treat #103 as a R2R breaking change since the connection between interop, type layout and R2R was not understood.

davidwrighton · 2021-05-13T20:56:57Z

src/coreclr/tools/Common/TypeSystem/Common/MetadataFieldLayoutAlgorithm.cs

@@ -304,17 +304,17 @@ protected virtual void FinalizeRuntimeSpecificStaticFieldLayout(TypeSystemContex
        {
        }

-        protected static ComputedInstanceFieldLayout ComputeExplicitFieldLayout(MetadataType type, int numInstanceFields)
+        protected ComputedInstanceFieldLayout ComputeExplicitFieldLayout(MetadataType type, int numInstanceFields)


Is there a reason why you changed this from static to non-static?

At one point it was needed because there used to be a helper method calculating the offset bias. I have double checked it's not longer necessary after all the refactorings, will remove, thanks for pointing that out!

jkotas · 2021-05-13T21:10:15Z

src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs

-            }
-
-            return false;
+            return type is MetadataType metadataType && (metadataType.IsSequentialLayout || metadataType.IsExplicitLayout);


Can we also delete/replace IsBlittableType from ComputeInstanceFieldLayout so that the layout does not depend on the algorithm to compute blittability?

The regression test <code>src\tests\JIT\Regressions\JitBlue\Runtime_46239</code> exercises various interesting corner cases of type layout that weren't handled properly in Crossgen2 on x86 and ARM[32]. This change fixes the remaining deficiencies and it also adds provisions for better runtime logging upon type layout mismatches. Thanks Tomas

With this change, the only remaining pipelines using Crossgen1 are "r2r.yml", "r2r-extra.yml" and "release-tests.yml". I haven't yet identified the pipeline running the "release-tests.yml" script; for the "r2r*.yml", these now remain the only pipelines exercising Crossgen1. I don't think it makes sense to switch them over to CG2 as we already have their CG2 counterparts; my expectation is that, once CG1 is finally decommissioned, they will be just deleted. Thanks Tomas

…ntial

I have basically reverted the marshalling code shuffles as they are no longer necessary based on the direction suggested by JanK. I haven't made the counterpart runtime changes yet, I'm running an initial smoke test to see all the places that blow up. Thanks Tomas

Based on Jan's suggestion I have changed the runtime behavior to align explicit layout structs by default. The funny part here was that the implementation is two-tiered, split into the case with vs. without layout metadata, using two completely different code paths. I have extracted a small part of the EEClassLayoutInfo calculation dealing with primitive field size and alignment into code usable by both codepaths to reduce code duplication. I have also bumped up the R2R compatibility version information as this is (intentionally) a globally breaking change for R2R images produced by older versions. As part of simplifying the various conditions I have deleted a section that claims we shouldn't be updating the number of instance bytes for blittable / managed sequential types. I just hope this is no longer needed so that I don't have to redo the marshalling helper shuffles again. Thanks Tomas

Per JanK's feedback I simplified CoreCLR runtime calculation of explicit layout size but I missed that Crossgen2 actually differs from it in a very subtle manner. This change fixes it by only accepting the explicit size if it's larger than or equal to the unaligned calculated explicit layout size. Thanks Tomas

… feedback

I have simplified runtime and Crossgen2 behavior in the presence of sequential layout by removing blittability from the picture as the IsBlittable check is very complex and hard to maintain in sync between the native runtime and the managed compiler. I have introduced a somewhat weaker and much simpler check MayContainGCPointers that returns FALSE for all IsBlittable types (and for a few others). Thanks Tomas

…back

In my investigation of the x86 failures I hit an interesting corner case of the class GCMemoryInfoData that is checked for having managed layout matching its native counterpart. Interestingly enough, that only worked because the class internally reported as non-blittable and not managed sequential so it ended up using the auto layout; when my change switched it over to use the sequential layout, it turned out that the sequential layout algorithm differs from its native counterpart. While the change is technically breaking this corner scenario, I suspect that the primary purpose of sequential layout is interop with native code and I find the fact that it's actuall mismatched concerning beyond the scope of my pending change. Thanks Tomas

I originally thought the modification may assist debugging layout issues as I did during the investigation. It however turns out that basically the same class name check can be put in EEClassLayoutInfo::CollectLayoutFieldMetadataThrowing so I am removing this as superfluous temporary instrumentation. Thanks Tomas

trylek · 2021-05-27T17:52:26Z

@jkotas - I believe I have managed to make some progress on the change based on your suggestions but in accordance with @davidwrighton I think it's even riskier to take than the original attempt using our existing blittability / managed sequential checks. As you can easily imagine, simplifying the condition for using sequential vs. auto layout silently switches over some types from one to the other and the bug tail is enormous, I've been digging through it over the last two weeks and I'm nowhere near green state. Two particular simple examples:

GCMemoryInfoData - this managed class has a native runtime counterpart and we use the DEFINE_CLASS / DEFINE_FIELD macros to check their consistency. Without the change the class uses auto layout as it's disqualified from managed sequential due to being a reference type; interestingly enough, once my change switched it over to start using the sequential layout, it fell out of sync on x86 because the sequential layout algorithm is a worse match for the native layout than the auto layout. I have fix for this particular issue but the reality is that the fix technically changes the semantics of sequential layout and potentially affects blittable types and interop in general.
LAHashDependentHashTracker - this managed class has a native runtime counterpart just like GCMemoryInfoData but without my change it's considered blittable i.e. eligible for using sequential layout. When I attempted to limit changes to pre-existing behavior by only using sequential layout for value types without GC pointers, it switched this class to auto layout which ended up reordering its two fields and falling out of sync with the runtime.

My current plan is to mark this change as NO-MERGE for now and revive its previous version after incorporating Michal's feedback in form of a new PR; I can return to this cleanup after we fork off the final .NET 6 shipping bits - the 1-2 month period of inability to make more substantial runtime changes should be ideal for repeated testing and polishing aspects of this change. In many cases it will be useful / needed to consult official native compiler documentation to make sure we're doing the right / expected thing.

This change needs substantial more work

trylek · 2021-06-03T16:08:00Z

Closing for now; I'll revive the PR after we fork off for .NET 6 and I'll start the new CI testing rounds to iron out the various problems.

trylek added the area-crossgen2-coreclr label Apr 16, 2021

trylek added this to the 6.0.0 milestone Apr 16, 2021

trylek requested review from davidwrighton and sandreenko April 16, 2021 22:07

trylek force-pushed the 46239 branch from 76ed0b5 to 0a445cb Compare April 16, 2021 22:39

davidwrighton reviewed Apr 16, 2021

View reviewed changes

davidwrighton requested changes Apr 16, 2021

View reviewed changes

trylek force-pushed the 46239 branch from 0a445cb to 7fadd2b Compare April 18, 2021 20:10

This was referenced Apr 18, 2021

Switch over the runtime outerloop pipeline to use Crossgen2 #51444

Closed

Test failure Regressions\\coreclr\\GitHub_49826\\test49826\\test49826.cmd #51542

Closed

trylek force-pushed the 46239 branch from 7fadd2b to 222347c Compare April 22, 2021 17:41

davidwrighton previously approved these changes Apr 23, 2021

View reviewed changes

trylek force-pushed the 46239 branch from 7d60346 to b7411a4 Compare April 23, 2021 23:17

jkotas reviewed Apr 25, 2021

View reviewed changes

jkotas reviewed Apr 26, 2021

View reviewed changes

jkotas reviewed Apr 27, 2021

View reviewed changes

trylek force-pushed the 46239 branch from e7a8d73 to 3078a93 Compare April 29, 2021 16:21

trylek force-pushed the 46239 branch from 3078a93 to b505be2 Compare May 7, 2021 16:58

runfoapp bot mentioned this pull request May 11, 2021

CI is currently broken "You cannot extract a file outside of the target path" #52596

Closed

trylek force-pushed the 46239 branch from cc30e5f to 3f1ee17 Compare May 13, 2021 18:26

davidwrighton reviewed May 13, 2021

View reviewed changes

jkotas reviewed May 13, 2021

View reviewed changes

trylek mentioned this pull request May 17, 2021

SingleFile diagnostic support - Add export table and DotNetRuntimeInfo to dumps #52731

Merged

runfoapp bot mentioned this pull request May 20, 2021

InvokeCodeThatShouldFirEvents_EnsureEventsFired fails on OSX #52710

Closed

trylek force-pushed the 46239 branch 2 times, most recently from bb2921f to b8d0ed4 Compare May 22, 2021 19:43

trylek added 15 commits May 24, 2021 21:05

One more fix for explicit layout with zero fields

54389f7

Remove no longer needed R2RMFLA override of IsBlittableOrManagedSeque…

9e4b351

…ntial

One more fix for the explicit layout tests

783b02f

Put back static modifier on ComputeExplicitFieldLayout per David's PR…

a2fff2b

… feedback

Make changes internally consistent and address DavidWr's offline feed…

46b95f0

…back

Additional consistency fixes for x86

11cc7df

Fix pre-existing solution consistency mismatch in Crossgen2.sln

c9297dd

trylek force-pushed the 46239 branch from fd7dc8b to d4da238 Compare May 24, 2021 19:09

Fix architecture-specific tests to cater for x86 Windows vs. Linux

3df7a59

trylek mentioned this pull request May 28, 2021

46239 v2 (no runtime layout changes) #53424

Merged

davidwrighton modified the milestones: 6.0.0, Future May 28, 2021

This was referenced May 28, 2021

leakwheel failing in CI #53452

Closed

Test failure Loader/classloader/StaticVirtualMethods/GenericContext/GenericContextTest/GenericContextTest.sh #53161

Closed

trylek closed this Jun 3, 2021

ghost locked as resolved and limited conversation to collaborators Jul 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regression test 46239 with Crossgen2 and improve runtime logging #51416

Fix regression test 46239 with Crossgen2 and improve runtime logging #51416

trylek commented Apr 16, 2021

davidwrighton Apr 16, 2021

davidwrighton Apr 16, 2021

MichalStrehovsky Apr 17, 2021

trylek Apr 22, 2021

MichalStrehovsky Apr 23, 2021

jkotas Apr 23, 2021

trylek Apr 23, 2021

jkotas Apr 23, 2021

jkotas Apr 23, 2021

trylek Apr 23, 2021

davidwrighton left a comment

BruceForstall commented Apr 20, 2021

BruceForstall commented Apr 21, 2021

jkotas Apr 25, 2021

trylek Apr 25, 2021

jkotas Apr 25, 2021

trylek Apr 25, 2021

jkotas Apr 25, 2021 •

edited

Loading

trylek Apr 25, 2021

jkotas Apr 26, 2021 •

edited

Loading

jkotas Apr 26, 2021

jkotas Apr 27, 2021

davidwrighton May 13, 2021

trylek May 13, 2021

jkotas May 13, 2021

trylek commented May 27, 2021

trylek commented Jun 3, 2021

		@@ -2257,7 +2249,7 @@ private bool NeedsTypeLayoutCheck(TypeDesc type)

		private bool HasLayoutMetadata(TypeDesc type)

Fix regression test 46239 with Crossgen2 and improve runtime logging #51416

Fix regression test 46239 with Crossgen2 and improve runtime logging #51416

Conversation

trylek commented Apr 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidwrighton left a comment

Choose a reason for hiding this comment

BruceForstall commented Apr 20, 2021

BruceForstall commented Apr 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas Apr 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas Apr 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trylek commented May 27, 2021

trylek commented Jun 3, 2021

jkotas Apr 25, 2021 •

edited

Loading

jkotas Apr 26, 2021 •

edited

Loading