-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix HFA detection in Crossgen2 #80218
Conversation
According to customer feedback some WPF apps are crashing on arm64 at runtime in debug mode when compiled with Crossgen2. Based on the initial investigation by Anton Lapounov and with help from Jan Vorlicek I have managed to identify that the problem is caused by a mismatch between the native CoreCLR runtime and Crossgen2 w.r.t. identification of HFA types. This change puts Crossgen2 behavior in sync with the CoreCLR runtime. I have verified locally that this makes the GC ref map for the method System.Windows.Media.PathGeometry.GetPathBoundsAsRB identical with the runtime version and avoids the assertion failure that was previously triggered in debug CoreCLR builds due to this mismatch. Thanks Tomas
// eligible for HA, but it is hard to tell the real intent. Make it simple and just | ||
// unconditionally disable HAs for explicit layout. | ||
if (metadataType.IsExplicitLayout) | ||
return NotHA; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will a structure like below be considered an HFA now?
[StructLayout(LayoutKind.Explicit)]
struct Foo
{
[FieldOffset(0)] double f1;
[FieldOffset(5)] double f2;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Anton, you have raised a very interesting point. In the particular case you call out it won't because I think the total size will be 13 and according to line 1032 it will fail checking that the size is divisible by the floating point size, however in the loop over fields the individual fields aren't checked for the presence of proper alignment. Interestingly enough, in the original native runtime version I used for the update,
runtime/src/coreclr/vm/class.cpp
Line 1595 in 1928cd2
for (UINT i = 0; i < GetNumInstanceFields(); i++) |
the check is present for R4 and R8 types but not for the general valuetype case so that, if I'm not mistaken, replacing the type of your f2
with a struct
containing a single double
field would be erroneously considered HFA by the runtime. As a first step, I'll update this PR by adding the modulus check to the loop, we can then discuss whether the native runtime also needs fixing or if I'm just misreading the code.
Do you think it might be helpful to add a comment requesting to keep this code in sync with particular runtime functions? |
For sure, good idea, I'll add that as well. |
The runtime-coreclr crossgen2 outerloop failures are unrelated, something's broken around the R2RDump validation, /cc @cshung in case you might see off the top of your head what's happening, otherwise I'll investigate (the Crossgen2 outerloop runs require attention anyway as they aren't in good shape right now). @janvorli / @AntonLapounov / @MichalStrehovsky, can one of you please take another look and let me know what remains to be fixed before this can be merged in? I have an e-mail thread with the internal customer who hit this and I promised some progress this week. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good regarding syncing with JIT. I think we need to follow up on the issue you found.
It looks like there are two main types of failures: This is potentially a command-line construction issue, I am not sure why only a few referenced assemblies failed to resolve. I don't recall I spent time thinking about that, so might be just a missed command-line switch in the scripts.
And here is possibly a real bug, either in crossgen2 or r2rdump, this should be investigated at a higher priority.
|
Thanks Anton and Andrew for your feedback. I have created a follow-up issue for the missing alignment check: |
/backport to release/6.0 |
Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/3961663733 |
/backport to release/7.0 |
Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3961666306 |
@trylek backporting to release/6.0 failed, the patch most likely resulted in conflicts: $ git am --3way --ignore-whitespace --keep-non-patch changes.patch
Applying: Fix HFA detection in Crossgen2
Applying: Add modulus checks for fields and runtime reference comment per Anton's PR feedback
error: sha1 information is lacking or useless (src/coreclr/tools/Common/TypeSystem/Common/MetadataFieldLayoutAlgorithm.cs).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 Add modulus checks for fields and runtime reference comment per Anton's PR feedback
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128 Please backport manually! |
@trylek an error occurred while backporting to release/6.0, please check the run log for details! Error: git am failed, most likely due to a merge conflict. |
@trylek backporting to release/7.0 failed, the patch most likely resulted in conflicts: $ git am --3way --ignore-whitespace --keep-non-patch changes.patch
Applying: Fix HFA detection in Crossgen2
Applying: Add modulus checks for fields and runtime reference comment per Anton's PR feedback
error: sha1 information is lacking or useless (src/coreclr/tools/Common/TypeSystem/Common/MetadataFieldLayoutAlgorithm.cs).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 Add modulus checks for fields and runtime reference comment per Anton's PR feedback
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128 Please backport manually! |
@trylek an error occurred while backporting to release/7.0, please check the run log for details! Error: git am failed, most likely due to a merge conflict. |
Description: This change puts HFA calculation in Crossgen2 in sync with native CoreCLR runtime for value types with explicit layout. Previously Crossgen2 had a shortcut in the routine deciding that structs with explicit layouts are never marked as HFA that disagreed with the CoreCLR runtime; consequently, on arm64, Crossgen2 disagreed with the runtime on whether or not a function returning such a type should allocate the stack slot for return value, basically messing up the calling convention and GC refmap, resulting in various random AVs and corruptions. This was first observed by an internal customer in WPF apps where MilRectD is the type in question, later JanK filed the issue 79327 for the same problem. Customer impact: Random runtime crashes on arm64. Regression: Nope, I believe the incomplete implementation was the original one, this change just "improves it" by putting it in better sync with the native runtime. I have also added a code comment mentioning that these two need to be kept in sync. Risk: Low - the error in the previous implementation is obvious, R2RDump and my new runtime diagnostics clearly show the GC refmap mismatch caused by this problem and its fixing after applying the Crossgen2 fix. Link to issue: dotnet#79327 Link to PR against main: dotnet#80218 Publishing impact: In the particular case of the WPF app the problem was in the PresentationCore.dll assembly. The assembly (or rather the entire WPF) need to be recompiled with Crossgen2 with the fix applied for this to take effect. For now I assume that is an automated part of the servicing process. Thanks Tomas
Description: This change puts HFA calculation in Crossgen2 in sync with native CoreCLR runtime for value types with explicit layout. Previously Crossgen2 had a shortcut in the routine deciding that structs with explicit layouts are never marked as HFA that disagreed with the CoreCLR runtime; consequently, on arm64, Crossgen2 disagreed with the runtime on whether or not a function returning such a type should allocate the stack slot for return value, basically messing up the calling convention and GC refmap, resulting in various random AVs and corruptions. This was first observed by an internal customer in WPF apps where MilRectD is the type in question, later JanK filed the issue 79327 for the same problem. Customer impact: Random runtime crashes on arm64. Regression: Nope, I believe the incomplete implementation was the original one, this change just "improves it" by putting it in better sync with the native runtime. I have also added a code comment mentioning that these two need to be kept in sync. Risk: Low - the error in the previous implementation is obvious, R2RDump and my new runtime diagnostics clearly show the GC refmap mismatch caused by this problem and its fixing after applying the Crossgen2 fix. Link to issue: Link to PR against main: Publishing impact: In the particular case of the WPF app the problem was in the PresentationCore.dll assembly. The assembly (or rather the entire WPF) need to be recompiled with Crossgen2 with the fix applied for this to take effect. For now I assume that is an automated part of the servicing process. Thanks Tomas
Description: This change puts HFA calculation in Crossgen2 in sync with native CoreCLR runtime for value types with explicit layout. Previously Crossgen2 had a shortcut in the routine deciding that structs with explicit layouts are never marked as HFA that disagreed with the CoreCLR runtime; consequently, on arm64, Crossgen2 disagreed with the runtime on whether or not a function returning such a type should allocate the stack slot for return value, basically messing up the calling convention and GC refmap, resulting in various random AVs and corruptions. This was first observed by an internal customer in WPF apps where MilRectD is the type in question, later JanK filed the issue 79327 for the same problem. Customer impact: Random runtime crashes on arm64. Regression: Nope, I believe the incomplete implementation was the original one, this change just "improves it" by putting it in better sync with the native runtime. I have also added a code comment mentioning that these two need to be kept in sync. Risk: Low - the error in the previous implementation is obvious, R2RDump and my new runtime diagnostics clearly show the GC refmap mismatch caused by this problem and its fixing after applying the Crossgen2 fix. Link to issue: #79327 Link to PR against main: #80218 Publishing impact: In the particular case of the WPF app the problem was in the PresentationCore.dll assembly. The assembly (or rather the entire WPF) need to be recompiled with Crossgen2 with the fix applied for this to take effect. For now I assume that is an automated part of the servicing process. Thanks Tomas
Description: This change puts HFA calculation in Crossgen2 in sync with native CoreCLR runtime for value types with explicit layout. Previously Crossgen2 had a shortcut in the routine deciding that structs with explicit layouts are never marked as HFA that disagreed with the CoreCLR runtime; consequently, on arm64, Crossgen2 disagreed with the runtime on whether or not a function returning such a type should allocate the stack slot for return value, basically messing up the calling convention and GC refmap, resulting in various random AVs and corruptions. This was first observed by an internal customer in WPF apps where MilRectD is the type in question, later JanK filed the issue 79327 for the same problem. Customer impact: Random runtime crashes on arm64. Regression: Nope, I believe the incomplete implementation was the original one, this change just "improves it" by putting it in better sync with the native runtime. I have also added a code comment mentioning that these two need to be kept in sync. Risk: Low - the error in the previous implementation is obvious, R2RDump and my new runtime diagnostics clearly show the GC refmap mismatch caused by this problem and its fixing after applying the Crossgen2 fix. Link to issue: Link to PR against main: Publishing impact: In the particular case of the WPF app the problem was in the PresentationCore.dll assembly. The assembly (or rather the entire WPF) need to be recompiled with Crossgen2 with the fix applied for this to take effect. For now I assume that is an automated part of the servicing process. Thanks Tomas
According to customer feedback some WPF apps are crashing on arm64 at runtime in debug mode when compiled with Crossgen2. Based on the initial investigation by Anton Lapounov and with help from Jan Vorlicek I have managed to identify that the problem is caused by a mismatch between the native CoreCLR runtime and Crossgen2 w.r.t. identification of HFA types.
This change puts Crossgen2 behavior in sync with the CoreCLR runtime. I have verified locally that this makes the GC ref map for the method System.Windows.Media.PathGeometry.GetPathBoundsAsRB identical with the runtime version and avoids the assertion failure that was previously triggered in debug CoreCLR builds due to this mismatch.
Thanks
Tomas
/cc @dotnet/crossgen-contrib