-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assert failure: executionAborted in GcInfoDecoder::EnumerateLiveSlots
#102370
Comments
Also cc @VSadov, I'm not familiar with the logic here, but wonder if it could be related to the new GC safe points. |
Tagging subscribers to this area: @mangod9 |
With interruptible GC safe points we can stress test each and every safe point. So - the change to safe points could be involved here, but it could also expose some existing bug. Does this happen without the PR change? |
I don't know, I haven't investigated. |
Always on arm32 and always in The failure is strange though. It means that we try to initiate a stack walk in a fully-interruptible method, but the IP happens to not be in one of the interruptible ranges. It is hard to think of how this could happen as anything that leads to stack walks should at some point ask "is this IP interruptible?". |
The last failure is interesting as it is not in a JIT stress run. |
Here is 100% reliable repro on linux-x64: https://gist.github.com/jakobbotsch/b7b98e082d7be7a189d2edc23b375a89 I think the issue was exposed by #101761. It is similar in spirit to #102919, #103300, #104042, where the JIT is incorrectly handling the fact that we now have a managed helper call in a new location. I am looking into a fix. |
…opy with write barrier calls When the JIT generates code for a tailcall it must generate code to write the arguments into the incoming parameter area. Since the GC ness of the arguments of the tailcall may not match the GC ness of the parameters, we have to disable GC before we start writing these. This is done by finding the earliest `GT_PUTARG_STK` node and placing the start of the NOGC region right before it. In addition, there is logic to take care of potential overlap between the arguments and parameters. For example, if the call has an operand that uses one of the parameters, then we must take care that we do not override that parameter with the tailcall argument before the use of it. To do so, we sometimes may need to introduce copies from the parameter locals to locals on the stack frame. This used to work fine, however, with dotnet#101761 we started transforming block copies into managed calls in certain scenarios. It was possible for the JIT to decide to introduce a copy to a local and for this transformation to then kick in. This would cause us to end up with the managed helper call after starting the nogc region. In checked builds this would hit an assert during GC scan; in release builds, it would end up with corrupted data. The fix here is to make sure we insert the `GT_START_NOGC` after all the potential temporary copies we may introduce as part of the tailcat stll logic. There was an additional assumption that the first `PUTARG_STK` operand was the earliest one in execution order. That is not guaranteed, so this change stops relying on that as well by introducing a new `LIR::FirstNode` and using that to determine the earliest `PUTARG_STK` node. Fix dotnet#102370 Fix dotnet#104123 Fix dotnet#105441
@jakobbotsch - Also the first instruction of an epilog should not have GC forbidden. It was fixed in https://github.com/dotnet/runtime/pull/104336/files#r1664973906 , but right now reverted. In theory that also could cause |
Maybe you can PR that separately, if the fix for the reversion will take a while? |
…opy with write barrier calls When the JIT generates code for a tailcall it must generate code to write the arguments into the incoming parameter area. Since the GC ness of the arguments of the tailcall may not match the GC ness of the parameters, we have to disable GC before we start writing these. This is done by finding the earliest `GT_PUTARG_STK` node and placing the start of the NOGC region right before it. In addition, there is logic to take care of potential overlap between the arguments and parameters. For example, if the call has an operand that uses one of the parameters, then we must take care that we do not override that parameter with the tailcall argument before the use of it. To do so, we sometimes may need to introduce copies from the parameter locals to locals on the stack frame. This used to work fine, however, with #101761 we started transforming block copies into managed calls in certain scenarios. It was possible for the JIT to decide to introduce a copy to a local and for this transformation to then kick in. This would cause us to end up with the managed helper call after starting the nogc region. In checked builds this would hit an assert during GC scan; in release builds, it would end up with corrupted data. The fix here is to make sure we insert the `GT_START_NOGC` after all the potential temporary copies we may introduce as part of the tailcat stll logic. There was an additional assumption that the first `PUTARG_STK` operand was the earliest one in execution order. That is not guaranteed, so this change stops relying on that as well by introducing a new `LIR::FirstNode` and using that to determine the earliest `PUTARG_STK` node. Fix #102370 Fix #104123 Fix #105441
I am considering that, but need to be sure it was not the part of the PR that made JIT stress unhappy. |
…calls in face of bulk copy with write barrier calls (#105572) * JIT: Fix placement of `GT_START_NOGC` for tailcalls in face of bulk copy with write barrier calls When the JIT generates code for a tailcall it must generate code to write the arguments into the incoming parameter area. Since the GC ness of the arguments of the tailcall may not match the GC ness of the parameters, we have to disable GC before we start writing these. This is done by finding the earliest `GT_PUTARG_STK` node and placing the start of the NOGC region right before it. In addition, there is logic to take care of potential overlap between the arguments and parameters. For example, if the call has an operand that uses one of the parameters, then we must take care that we do not override that parameter with the tailcall argument before the use of it. To do so, we sometimes may need to introduce copies from the parameter locals to locals on the stack frame. This used to work fine, however, with #101761 we started transforming block copies into managed calls in certain scenarios. It was possible for the JIT to decide to introduce a copy to a local and for this transformation to then kick in. This would cause us to end up with the managed helper call after starting the nogc region. In checked builds this would hit an assert during GC scan; in release builds, it would end up with corrupted data. The fix here is to make sure we insert the `GT_START_NOGC` after all the potential temporary copies we may introduce as part of the tailcat stll logic. There was an additional assumption that the first `PUTARG_STK` operand was the earliest one in execution order. That is not guaranteed, so this change stops relying on that as well by introducing a new `LIR::FirstNode` and using that to determine the earliest `PUTARG_STK` node. Fix #102370 Fix #104123 Fix #105441 --------- Co-authored-by: Jakob Botsch Nielsen <jakob.botsch.nielsen@gmail.com>
The partial re-revert, that contains only JIT fixes has some GC stress failures (extra ones compared to a no-op GC stress off main). Suspiciously the failures are all on arm64 and with JITStress. So, it is possible that it is the JIT changes for the No-GC range in the epilog that make JIT stress unhappy. Maybe the "fix" was incomplete and there is one more thing, perhaps in in JIT stress itself, that indirectly relies on old behavior. |
Turns out it was the "fix" that exposed bad GC info on the instruction right after Treating the first instruction of epilog as uninterruptible is not the right thing to do, in general, especially after calls that could return or could do GC. The part that I have a few solutions in mind, but need to figure what actually works more naturally. |
Build Information
Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=678333&view=results
Build error leg or test failing:
Example console log: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-102261-merge-e822cbb23a0f465186/LibraryImportGenerator.Unit.Tests/1/console.c3aa79dd.log?helixlogtype=result
Maybe related to #101890?
Error Message
Fill the error message using step by step known issues guidance.
Known issue validation
Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=678333
Error message validated:
[executionAborted
]Result validation: ❌ Known issue did not match with the provided build.
Validation performed at: 5/17/2024 8:39:56 AM UTC
Report
Summary
The text was updated successfully, but these errors were encountered: