-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assert failure: !pFlat->HasReadyToRunHeader()
#68654
Comments
I got this on another PR last night. |
We need to set firm criteria but generally we're labeling failures in PR validation/CI with blocking-clean-ci, if they have happened in the last 2 weeks or so and are not one off cases. Feel free to use this label. |
adding @trylek to please take a look since this is affecting PRs. |
Hmm, this is weird. This happens when the CoreCLR native runtime runs the crossgen2 app to compile System.Private.CoreLib; but at that point Crossgen2 itself shouldn't be crossgenned, that only happens "later" in the installer build. At a first glance, I suspect this to be related to @agocke's #67636 (the timing more or less matches as the PR was merged in 6 days ago). I also believe that this is the code area that @VSadov substantially refactored some time ago as part of his work on single exe publishing; maybe the invariants have shifted somehow w.r.t. Native AOT CG2 build? |
Looks like this is only happening on Mac. NativeAOT isn't supported on Mac, so this should be R2R + single-file. |
This might be a real product issue. I think this is the only testing configuration where we don't run single file in Release, where I presume the assert doesn't exist. + @VSadov |
I think this is the same as #67062 I am investigating the issue and it looks like we sometimes see PE sections overlapping in memory. This is either a loader bug or crossgen bug. Most likely crossgen. We used to silently take a fallback approach, now it will cause a failure (and assert in Debug). |
The algorithms are the same for OSX and Unix in general, but we see these failures on OSX only. The main difference is that OSX has larger OS page and we align to that when we map. I think we may not leave enough room between sections RVAs for potentially bigger alignment. That is my current theory. |
The failure indicates that crossgen (or some of its assemblies) is R2R already. If the bug is in crossgen, I wonder how soon this failure will go away after it is fixed in main. |
For the log,
I think it's possible that System.Private.CoreLib may already be crossgen'd here, but the rest of the framework assemblies might not be. IIRC we run crossgen against S.P.C in the normal build, but not against the libraries. |
I think there's a bit of subtlety worth elaborating on. Please note that the build step that fails is execution of Crossgen2 with the aim to compile System.Private.CoreLib; if we were trying to recompile an already crossgenned SPC, that would probably amount to a build script bug causing us to run the step twice or something. But I believe that's not what's happening for two reasons:
@VSadov, what is the difference between Linux and OSX page sizes for the individual architectures? The only conditional I originally put in was that 32bit = 4K, 64bit = 64K. If it's more involved, this may certainly need updating. Having said that, please note that the check for the presence of R2R header among others checks the R2R version number - I have a hard time to imagine how this could just randomly appear in the in-memory mapped executable space due to some section overlaps. |
Ah, yes, if this is the test execution of
Which notably runs against |
Yes, I believe that's exactly it according to the log: Assert failure(PID 31107 [0x00007983], Thread: 154553 [0x25bb9]): !pFlat->HasReadyToRunHeader() File: /Users/runner/work/1/s/src/coreclr/vm/peimagelayout.cpp Line: 87 Image: /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2 /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmp5d8ef3b259b44fadafe924759643b8b1.exec.cmd: line 2: 31107 Abort trap: 6 /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2 /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Debug/IL/System.Private.CoreLib.dll --out /Users/runner/work/1/s/artifacts/obj/Microsoft.NETCore.App.Crossgen2/Release/net7.0/osx-x64/S.P.C.tmp /Users/runner/work/1/s/src/installer/pkg/sfx/Microsoft.NETCore.App/Microsoft.NETCore.App.Crossgen2.sfxproj(70,5): error MSB3073: The command "/Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2 /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Debug/IL/System.Private.CoreLib.dll --out /Users/runner/work/1/s/artifacts/obj/Microsoft.NETCore.App.Crossgen2/Release/net7.0/osx-x64/S.P.C.tmp" exited with code 134. ##[error]src/installer/pkg/sfx/Microsoft.NETCore.App/Microsoft.NETCore.App.Crossgen2.sfxproj(70,5): error MSB3073: (NETCORE_ENGINEERING_TELEMETRY=Build) The command "/Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2 /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Debug/IL/System.Private.CoreLib.dll --out /Users/runner/work/1/s/artifacts/obj/Microsoft.NETCore.App.Crossgen2/Release/net7.0/osx-x64/S.P.C.tmp" exited with code 134. |
There's one other thing that I find super weird. According to runtime/src/coreclr/crossgen-corelib.proj Line 83 in 463be3a
we should be running crossgen2 via dotnet but in the above log it looks like we're directly executing the native app "crossgen2". I don't see any such code path capable of emitting the above command-line in my repo clone that's about 2 days old; where can it be coming from? |
Hmm, my bad, we're apparently running it through the sfxproj, I guess I'm now completely confused where / how the SPC crossgenning takes place. Is the crossgen2-corelib script still used? |
Yup, there are two crossgen calls now. The first is at the very beginning of the CoreCLR build, which we need for all later runs with S.P.C. The second is during package construction, where we compile crossgen2 as either NativeAOT or single-file. That's a live build -- it gathers all the live binaries and compiles it then. It then compiles S.P.C one last time, as a basic validation that the build works. That last build is the one that's failing. |
Ahh, thanks Andy, I see it now in the log. So that's actually an additional interesting data point - the "first" CG2 execution for SPC compilation under "dotnet" passes just fine, it's only the second one using presumably the single-file CG2 version that crashes when compiling SPC. In such case I'm going the reassign the bug to Vlad for now for further investigation. If it turns out that the problem is indeed in Crossgen2 producing incorrect PE layout on OSX, I'll be happy to get involved again, sadly I don't currently have a local OSX machine available for testing so I guess I'd need to work with someone who does. |
No, the scenario is like:
|
I see, thanks Vlad for clarifying. Please let me know if there's anything I can help with on the Crossgen2 side, for now the fact that this only fails when Crossgen2 is executed in the single-exe mode makes me believe it's likely rather related to the logic of loading assemblies from the bundled exe rather than a general Crossgen2 error - after all, the Crossgen2-compiled framework assemblies and tests have been running in the lab on OSX-x64 for more than two years by now without any crash like this. |
I wonder if this could be Debug/Release S.P.C/runtime mismatch, but I thought I guarded against that possibility with runtime/src/installer/pkg/sfx/Microsoft.NETCore.App/Microsoft.NETCore.App.Crossgen2.sfxproj Lines 26 to 30 in ff6af23
|
@agocke no, a S.P.C/runtime mismatch would not assert at layout time. It would have random behavior/crash later. @trylek yes, singlefile is a factor. We had some issues with this on OSX in the past. We were just handling it silently by using fallback strategy by copying instead of mapping. - The assembly would still load, but start up would be impacted, R2R disabled, etc... It is intentional that a failure to map causes asserts/failures now. Either way. What crossgen produce must match what loader expects. One of the two will need to take a fix. |
Fixed by #68845 (not sure why this didn't auto close) |
Seen in
runtime-dev-innerloop
:1) https://dev.azure.com/dnceng/public/_build/results?buildId=1741950&view=logs&j=9db4066d-6bf0-549a-7716-e181239d2ea7&t=ee7de8a6-b1ed-5700-3f81-b0654bf893a4
2) https://dev.azure.com/dnceng/public/_build/results?buildId=1742050&view=logs&j=9db4066d-6bf0-549a-7716-e181239d2ea7&t=ee7de8a6-b1ed-5700-3f81-b0654bf893a4
The text was updated successfully, but these errors were encountered: