Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HFA detection in Crossgen2 #80218

Merged
merged 2 commits into from
Jan 9, 2023
Merged

Fix HFA detection in Crossgen2 #80218

merged 2 commits into from
Jan 9, 2023

Conversation

trylek
Copy link
Member

@trylek trylek commented Jan 5, 2023

According to customer feedback some WPF apps are crashing on arm64 at runtime in debug mode when compiled with Crossgen2. Based on the initial investigation by Anton Lapounov and with help from Jan Vorlicek I have managed to identify that the problem is caused by a mismatch between the native CoreCLR runtime and Crossgen2 w.r.t. identification of HFA types.

This change puts Crossgen2 behavior in sync with the CoreCLR runtime. I have verified locally that this makes the GC ref map for the method System.Windows.Media.PathGeometry.GetPathBoundsAsRB identical with the runtime version and avoids the assertion failure that was previously triggered in debug CoreCLR builds due to this mismatch.

Thanks

Tomas

/cc @dotnet/crossgen-contrib

According to customer feedback some WPF apps are crashing on arm64
at runtime in debug mode when compiled with Crossgen2. Based on
the initial investigation by Anton Lapounov and with help from
Jan Vorlicek I have managed to identify that the problem is caused
by a mismatch between the native CoreCLR runtime and Crossgen2
w.r.t. identification of HFA types.

This change puts Crossgen2 behavior in sync with the CoreCLR
runtime. I have verified locally that this makes the GC ref map
for the method System.Windows.Media.PathGeometry.GetPathBoundsAsRB
identical with the runtime version and avoids the assertion failure
that was previously triggered in debug CoreCLR builds due to this
mismatch.

Thanks

Tomas
// eligible for HA, but it is hard to tell the real intent. Make it simple and just
// unconditionally disable HAs for explicit layout.
if (metadataType.IsExplicitLayout)
return NotHA;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a structure like below be considered an HFA now?

[StructLayout(LayoutKind.Explicit)]
struct Foo
{
    [FieldOffset(0)] double f1;
    [FieldOffset(5)] double f2;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Anton, you have raised a very interesting point. In the particular case you call out it won't because I think the total size will be 13 and according to line 1032 it will fail checking that the size is divisible by the floating point size, however in the loop over fields the individual fields aren't checked for the presence of proper alignment. Interestingly enough, in the original native runtime version I used for the update,

for (UINT i = 0; i < GetNumInstanceFields(); i++)

the check is present for R4 and R8 types but not for the general valuetype case so that, if I'm not mistaken, replacing the type of your f2 with a struct containing a single double field would be erroneously considered HFA by the runtime. As a first step, I'll update this PR by adding the modulus check to the loop, we can then discuss whether the native runtime also needs fixing or if I'm just misreading the code.

@AntonLapounov
Copy link
Member

This change puts Crossgen2 behavior in sync with the CoreCLR runtime.

Do you think it might be helpful to add a comment requesting to keep this code in sync with particular runtime functions?

@trylek
Copy link
Member Author

trylek commented Jan 5, 2023

Do you think it might be helpful to add a comment requesting to keep this code in sync with particular runtime functions?

For sure, good idea, I'll add that as well.

@trylek
Copy link
Member Author

trylek commented Jan 9, 2023

The runtime-coreclr crossgen2 outerloop failures are unrelated, something's broken around the R2RDump validation, /cc @cshung in case you might see off the top of your head what's happening, otherwise I'll investigate (the Crossgen2 outerloop runs require attention anyway as they aren't in good shape right now). @janvorli / @AntonLapounov / @MichalStrehovsky, can one of you please take another look and let me know what remains to be fixed before this can be merged in? I have an e-mail thread with the internal customer who hit this and I promised some progress this week. Thanks!

Copy link
Member

@AntonLapounov AntonLapounov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good regarding syncing with JIT. I think we need to follow up on the issue you found.

@cshung
Copy link
Member

cshung commented Jan 9, 2023

The runtime-coreclr crossgen2 outerloop failures are unrelated, something's broken around the R2RDump validation, /cc @cshung in case you might see off the top of your head what's happening, otherwise I'll investigate (the Crossgen2 outerloop runs require attention anyway as they aren't in good shape right now). @janvorli / @AntonLapounov / @MichalStrehovsky, can one of you please take another look and let me know what remains to be fixed before this can be merged in? I have an e-mail thread with the internal customer who hit this and I promised some progress this week. Thanks!

It looks like there are two main types of failures:

This is potentially a command-line construction issue, I am not sure why only a few referenced assemblies failed to resolve. I don't recall I spent time thinking about that, so might be just a missed command-line switch in the scripts.

Error: System.Exception: Missing reference assembly: <AssemblyName>

And here is possibly a real bug, either in crossgen2 or r2rdump, this should be investigated at a higher priority.

Error: System.BadImageFormatException: Failed to convert invalid RVA to offset: <BadOffset, often 6xxxx>

@trylek
Copy link
Member Author

trylek commented Jan 9, 2023

Thanks Anton and Andrew for your feedback. I have created a follow-up issue for the missing alignment check:

#80393

@trylek trylek merged commit 6f119cb into dotnet:main Jan 9, 2023
@trylek trylek deleted the Crossgen2HFAFix branch January 9, 2023 21:31
@trylek
Copy link
Member Author

trylek commented Jan 19, 2023

/backport to release/6.0

@github-actions
Copy link
Contributor

Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/3961663733

@trylek
Copy link
Member Author

trylek commented Jan 19, 2023

/backport to release/7.0

@github-actions
Copy link
Contributor

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3961666306

@github-actions
Copy link
Contributor

@trylek backporting to release/6.0 failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: Fix HFA detection in Crossgen2
Applying: Add modulus checks for fields and runtime reference comment per Anton's PR feedback
error: sha1 information is lacking or useless (src/coreclr/tools/Common/TypeSystem/Common/MetadataFieldLayoutAlgorithm.cs).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 Add modulus checks for fields and runtime reference comment per Anton's PR feedback
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

@github-actions
Copy link
Contributor

@trylek an error occurred while backporting to release/6.0, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

@github-actions
Copy link
Contributor

@trylek backporting to release/7.0 failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: Fix HFA detection in Crossgen2
Applying: Add modulus checks for fields and runtime reference comment per Anton's PR feedback
error: sha1 information is lacking or useless (src/coreclr/tools/Common/TypeSystem/Common/MetadataFieldLayoutAlgorithm.cs).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 Add modulus checks for fields and runtime reference comment per Anton's PR feedback
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

@github-actions
Copy link
Contributor

@trylek an error occurred while backporting to release/7.0, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

trylek added a commit to trylek/runtime that referenced this pull request Jan 19, 2023
Description:

This change puts HFA calculation in Crossgen2 in sync with native
CoreCLR runtime for value types with explicit layout. Previously
Crossgen2 had a shortcut in the routine deciding that structs with
explicit layouts are never marked as HFA that disagreed with the
CoreCLR runtime; consequently, on arm64, Crossgen2 disagreed with
the runtime on whether or not a function returning such a type
should allocate the stack slot for return value, basically messing
up the calling convention and GC refmap, resulting in various
random AVs and corruptions. This was first observed by an internal
customer in WPF apps where MilRectD is the type in question, later
JanK filed the issue 79327 for the same problem.

Customer impact:

Random runtime crashes on arm64.

Regression:

Nope, I believe the incomplete implementation was the original one,
this change just "improves it" by putting it in better sync with
the native runtime. I have also added a code comment mentioning
that these two need to be kept in sync.

Risk:

Low - the error in the previous implementation is obvious, R2RDump
and my new runtime diagnostics clearly show the GC refmap mismatch
caused by this problem and its fixing after applying the Crossgen2
fix.

Link to issue:

dotnet#79327

Link to PR against main:

dotnet#80218

Publishing impact:

In the particular case of the WPF app the problem was in the
PresentationCore.dll assembly. The assembly (or rather the entire
WPF) need to be recompiled with Crossgen2 with the fix applied
for this to take effect. For now I assume that is an automated part
of the servicing process.

Thanks

Tomas
trylek added a commit to trylek/runtime that referenced this pull request Jan 19, 2023
Description:

This change puts HFA calculation in Crossgen2 in sync with native
CoreCLR runtime for value types with explicit layout. Previously
Crossgen2 had a shortcut in the routine deciding that structs with
explicit layouts are never marked as HFA that disagreed with
the CoreCLR runtime; consequently, on arm64, Crossgen2 disagreed
with the runtime on whether or not a function returning such a type
should allocate the stack slot for return value, basically
messing up the calling convention and GC refmap, resulting in various
random AVs and corruptions. This was first observed by an internal
customer in WPF apps where MilRectD is the type in question, later
JanK filed the issue 79327 for the same problem.

Customer impact:

Random runtime crashes on arm64.

Regression:

Nope, I believe the incomplete implementation was the original one,
this change just "improves it" by putting it in better sync with
the native runtime. I have also added a code comment mentioning
that these two need to be kept in sync.

Risk:

Low - the error in the previous implementation is obvious, R2RDump
and my new runtime diagnostics clearly show the GC refmap mismatch
caused by this problem and its fixing after applying the Crossgen2 fix.

Link to issue:

Link to PR against main:

Publishing impact:

In the particular case of the WPF app the problem was in
the PresentationCore.dll assembly. The assembly (or rather the entire WPF)
need to be recompiled with Crossgen2 with the fix applied for this to
take effect. For now I assume that is an automated part of the servicing
process.

Thanks

Tomas
carlossanlop pushed a commit that referenced this pull request Feb 8, 2023
Description:

This change puts HFA calculation in Crossgen2 in sync with native
CoreCLR runtime for value types with explicit layout. Previously
Crossgen2 had a shortcut in the routine deciding that structs with
explicit layouts are never marked as HFA that disagreed with the
CoreCLR runtime; consequently, on arm64, Crossgen2 disagreed with
the runtime on whether or not a function returning such a type
should allocate the stack slot for return value, basically messing
up the calling convention and GC refmap, resulting in various
random AVs and corruptions. This was first observed by an internal
customer in WPF apps where MilRectD is the type in question, later
JanK filed the issue 79327 for the same problem.

Customer impact:

Random runtime crashes on arm64.

Regression:

Nope, I believe the incomplete implementation was the original one,
this change just "improves it" by putting it in better sync with
the native runtime. I have also added a code comment mentioning
that these two need to be kept in sync.

Risk:

Low - the error in the previous implementation is obvious, R2RDump
and my new runtime diagnostics clearly show the GC refmap mismatch
caused by this problem and its fixing after applying the Crossgen2
fix.

Link to issue:

#79327

Link to PR against main:

#80218

Publishing impact:

In the particular case of the WPF app the problem was in the
PresentationCore.dll assembly. The assembly (or rather the entire
WPF) need to be recompiled with Crossgen2 with the fix applied
for this to take effect. For now I assume that is an automated part
of the servicing process.

Thanks

Tomas
carlossanlop pushed a commit that referenced this pull request Feb 8, 2023
Description:

This change puts HFA calculation in Crossgen2 in sync with native
CoreCLR runtime for value types with explicit layout. Previously
Crossgen2 had a shortcut in the routine deciding that structs with
explicit layouts are never marked as HFA that disagreed with
the CoreCLR runtime; consequently, on arm64, Crossgen2 disagreed
with the runtime on whether or not a function returning such a type
should allocate the stack slot for return value, basically
messing up the calling convention and GC refmap, resulting in various
random AVs and corruptions. This was first observed by an internal
customer in WPF apps where MilRectD is the type in question, later
JanK filed the issue 79327 for the same problem.

Customer impact:

Random runtime crashes on arm64.

Regression:

Nope, I believe the incomplete implementation was the original one,
this change just "improves it" by putting it in better sync with
the native runtime. I have also added a code comment mentioning
that these two need to be kept in sync.

Risk:

Low - the error in the previous implementation is obvious, R2RDump
and my new runtime diagnostics clearly show the GC refmap mismatch
caused by this problem and its fixing after applying the Crossgen2 fix.

Link to issue:

Link to PR against main:

Publishing impact:

In the particular case of the WPF app the problem was in
the PresentationCore.dll assembly. The assembly (or rather the entire WPF)
need to be recompiled with Crossgen2 with the fix applied for this to
take effect. For now I assume that is an automated part of the servicing
process.

Thanks

Tomas
@ghost ghost locked as resolved and limited conversation to collaborators Feb 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants