-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasionally hitting error MSB6006: "csc.dll" exited with code 139 on linux (w/ GCDynamicAdaptationMode=1) #104123
Comments
Yes, that would be useful. Are you able to extract the stacktrace from the coredumps? It would help with routing of this issue. (https://learn.microsoft.com/en-us/troubleshoot/developer/webapps/aspnetcore/practice-troubleshoot-linux/lab-1-2-analyze-core-dumps-lldb-debugger has the steps.) |
Crash during GC at:
The GC heap is corrupted:
|
@LoopedBard3 Could you please let us know whether you still see it crashing after picking up a build that includes #103301? |
Yup, will watch for if the update fixes the issue 👍. |
Looking at one of the recent failing runs, #103301 does not seem to have fixed the issue. The SDK verison used in this recent build that still hit the failure had commit dotnet/sdk@e18cfb7 and had a Microsoft.NETCore.App.Ref commit of a900bbf (from Version.Details.xml#L19-L20). If there is a different version/link I should be looking at to make sure we have the update, let me know. |
Would it possible to set |
This error is hit in various superpmi collect test legs for Linux. |
I ran a test building both with and without the envvar set here: https://dev.azure.com/dnceng/internal/_build/results?buildId=2487207&view=results, jobs with gcdynamicadaptationmodeoff have DOTNET_GCDynamicAdaptationMode=0 set. This is only one test but turning off GCDynamicAdaptationMode does seem to have fixed the issue as all three jobs with the envvar set succeeded while the other three failed. |
@dotnet/gc Could you please take a look give that this crash does not repro with DATAS disabled? Note that these builds run on machines with many cores. It can explain why we do not see more instances of this crash. I have looked at number of the crash dumps. The only common pattern that I have observed was that DATAS scaled up number of GC heaps multiple times, but the nature of GC heap corruption was very different each time. |
Yeah we will take a look. @LoopedBard3, does this repro locally for you or only does on the build machines? |
I have not tried locally yet as I don't have a personal dedicated Linux box, but I can see if I can get it to repro manually on one of the pipeline runners as they are actual machines. |
@mrsharm I reproed it using your instructions and script (thanks!) and looked at the dump. I think it is the same issue -- I see a call from
I have not been able to repro it with these yet. Sadly |
@jakobbotsch |
…calls in face of bulk copy with write barrier calls (#105572) * JIT: Fix placement of `GT_START_NOGC` for tailcalls in face of bulk copy with write barrier calls When the JIT generates code for a tailcall it must generate code to write the arguments into the incoming parameter area. Since the GC ness of the arguments of the tailcall may not match the GC ness of the parameters, we have to disable GC before we start writing these. This is done by finding the earliest `GT_PUTARG_STK` node and placing the start of the NOGC region right before it. In addition, there is logic to take care of potential overlap between the arguments and parameters. For example, if the call has an operand that uses one of the parameters, then we must take care that we do not override that parameter with the tailcall argument before the use of it. To do so, we sometimes may need to introduce copies from the parameter locals to locals on the stack frame. This used to work fine, however, with #101761 we started transforming block copies into managed calls in certain scenarios. It was possible for the JIT to decide to introduce a copy to a local and for this transformation to then kick in. This would cause us to end up with the managed helper call after starting the nogc region. In checked builds this would hit an assert during GC scan; in release builds, it would end up with corrupted data. The fix here is to make sure we insert the `GT_START_NOGC` after all the potential temporary copies we may introduce as part of the tailcat stll logic. There was an additional assumption that the first `PUTARG_STK` operand was the earliest one in execution order. That is not guaranteed, so this change stops relying on that as well by introducing a new `LIR::FirstNode` and using that to determine the earliest `PUTARG_STK` node. Fix #102370 Fix #104123 Fix #105441 --------- Co-authored-by: Jakob Botsch Nielsen <jakob.botsch.nielsen@gmail.com>
As a heads up, have also been running the repro with both:
for the past 2-3 hours and so far, we haven't observed a repro. |
I am not entirely sure what the result is in release builds when the VM comes across the IP inside the nogc region (@VSadov can tell us) -- however, my understanding is that this results in generic GC hole-like behavior, with certain GC references not being updated during relocation. If so, then it can definitely result in what you say and more general arbitrary heap corruption. |
That is correct. Incorrect root reporting will likely lead to heap corruptions. Something may get collected while still reachable, something may get moved and not updated,... and sometimes you may get lucky and nothing wrong will happen. In a few cases in checked builds this will cause asserts in |
In this particular case it's not that the GC reporting from this location is problematic or that the GC information is wrong, so I think the VM must be skipping the reporting entirely in release builds when it comes across this situation. Otherwise I think we wouldn't have seen the issue here. The fix does not change any GC information reporting, it just starts the nogc region from a later location. |
Would it make sense for the runtime to handle this situation (non-leaf frame at a non-gc safe point during GC) as a fatal error? |
Only fully interruptible methods keep the interruptibility info. And this is a case that we can detect and currently assert. |
@LoopedBard3 - feel free to change the tags and assignees but this doesn't seem like a GC related issue. |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Sounds good, it seems the error is no longer, or at least far less often, occurring in our pipeline. A new error with the same code does seem to be happening in our wasm runs though:
The error seems to have last a primary problem in this build: 20240727.1 the following build no longer hitting the issue: 20240727.2. The compare range for these builds is: 7e429c2...dc7d7bc but that doesn't have anything obvious for a change in the runtime for the fix. The dotnet sdk versions for each run were:
fixed:
|
Just encountered this while source-building preview7 after rebootstrapping to consume in #105572. The failure occurred in this build:
FWIW the error more closely aligns with what's described in #105441 cc @dotnet/source-build-internal |
@ellahathaway looks like the last several builds of main have been fine, and the error you saw was on arm64, so guessing this might have been fixed by #105832. In those builds, is crossgen2 being run with a live-built dotnet (or something relatively close)? @jakobbotsch if the above is true it might give us a bit of reassurance that the fix from #105832 worked, though I don't know how often this particular crossgen2 error surfaced in the weeks before, so it might not... |
I think we have good confidence that this issue is fixed, but I'm going to add blocking-release to this and keep it open and in .NET 9 for tracking purposes until we pick up the new SDK. |
We just hit this error in https://dev.azure.com/dnceng-public/public/_build/results?buildId=774394&view=logs&j=5ac7b393-e840-5549-7fb4-a4479af8e7e3&t=29df2fa2-0d20-51bd-e85a-8b546e86c529
But it looks like we are using an older SDK (9.0.100-preview.7.24371.4). I'm not sure if the fix came in after, but logging the instance here for tracking purposes. |
dotnet/source-build#4576 - I suspect that SB just encountered this error in one of our 9.0 builds |
Description
In the dotnet-runtime-perf pipeline, we are seeing multiple Linux jobs hitting the error
dotnet/x64/sdk/9.0.100-preview.7.24323.5/Roslyn/Microsoft.CSharp.Core.targets(85,5): error MSB6006: "csc.dll" exited with code 139.
when building our MicroBenchmarks.csproj file for BDN testing. This is occurring on between 0-3 of the 30 helix workitems we send out for each job with no consistency for which of the 30 workitems is affected or the agent machine hitting the error. Pretty sure I have a CoreDump from some of these failed runs if that would be useful.Potentially related to: #57558
Reproduction Steps
Need to test more but this should work for reproing, though as mentioned in the description, hitting the error is not consistent.
Steps (high level):
python3 ./scripts/benchmarks_ci.py --csproj ./src/benchmarks/micro/MicroBenchmarks.csproj --incremental no --architecture x64 -f net9.0 --dotnet-versions 9.0.100-preview.6.24320.9 --bdn-arguments="--anyCategories Libraries Runtime --logBuildOutput --generateBinLog --partition-count 30 --partition-index 29"
Steps (inner command, this should match but ping if this seems to be missing a step):
dotnet-install.sh -InstallDir ./performance/tools/dotnet/x64 -Architecture x64 -Version 9.0.100-preview.6.24320.9
dotnet run --project ./src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework net9.0 --no-restore --no-build -- --anyCategories Libraries Runtime "" --logBuildOutput --generateBinLog --partition-count 30 --partition-index 29 --artifacts ./artifacts/BenchmarkDotNet.Artifacts --packages ./artifacts/packages --buildTimeout 1200
Expected behavior
Build is successful and continues to run the BenchmarkDotNet tests.
Actual behavior
The build fails
Full logs from example run with the error available: dotnet-runtime-perf Run 20240620.3. The specific partitions are Partition 2 and Partition 6 from the job 'Performance linux x64 release coreclr JIT micro perfowl NoJS False False False net9.0'.
Regression?
This started occurring between our runs dotnet-runtime-perf Run 20240620.2 and dotnet-runtime-perf Run 20240620.3.
The runtime repo comparison for between these two jobs is 4a7fe65...b0c4728.
Our performance repo also took one update but it seems highly unlikely to be related: dotnet/performance#4279.
Version difference information available in the information section below.
Known Workarounds
None
Configuration
.NET Version information:
Information from first run with error dotnet-runtime-perf Run 20240620.3:
Information from run before error dotnet-runtime-perf Run 20240620.2:
This is happening across multiple different machine hardware configurations.
Other information
No response
The text was updated successfully, but these errors were encountered: