Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert failure: executionAborted in GcInfoDecoder::EnumerateLiveSlots #102370

Closed
jakobbotsch opened this issue May 17, 2024 · 14 comments · Fixed by #105551
Closed

Assert failure: executionAborted in GcInfoDecoder::EnumerateLiveSlots #102370

jakobbotsch opened this issue May 17, 2024 · 14 comments · Fixed by #105551
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs Known Build Error Use this to report build issues in the .NET Helix tab
Milestone

Comments

@jakobbotsch
Copy link
Member

jakobbotsch commented May 17, 2024

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=678333&view=results
Build error leg or test failing:
Example console log: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-102261-merge-e822cbb23a0f465186/LibraryImportGenerator.Unit.Tests/1/console.c3aa79dd.log?helixlogtype=result

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: LibraryImportGenerator.Unit.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  LibraryImportGenerator.Unit.Tests (found 185 of 190 test cases)
  Starting:    LibraryImportGenerator.Unit.Tests (parallel test collections = on [2 threads], stop on fail = off)
    LibraryImportGenerator.UnitTests.Compiles.ValidateSnippetsWithMarshalType [SKIP]
      No current scenarios to test.

Assert failure(PID 28 [0x0000001c], Thread: 73 [0x0049]): executionAborted
    File: /__w/1/s/src/coreclr/vm/gcinfodecoder.cpp:801
    Image: /root/helix/work/correlation/dotnet

Maybe related to #101890?

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "executionAborted",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=678333
Error message validated: [executionAborted]
Result validation: ❌ Known issue did not match with the provided build.
Validation performed at: 5/17/2024 8:39:56 AM UTC

Report

Build Definition Test Pull Request
2503080 dotnet-runtime LibraryImportGenerator.Unit.Tests.WorkItemExecution
747517 dotnet/runtime System.Formats.Tar.Tests.WorkItemExecution #105099
2490036 dotnet-runtime System.Text.RegularExpressions.Tests.WorkItemExecution
2488046 dotnet-runtime LibraryImportGenerator.Unit.Tests.WorkItemExecution
2487307 dotnet-runtime LibraryImportGenerator.Unit.Tests.WorkItemExecution

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
1 1 5
@jakobbotsch jakobbotsch added blocking-clean-ci-optional Blocking optional rolling runs Known Build Error Use this to report build issues in the .NET Helix tab labels May 17, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 17, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 17, 2024
@jakobbotsch
Copy link
Member Author

Also cc @VSadov, I'm not familiar with the logic here, but wonder if it could be related to the new GC safe points.

@jkotas jkotas added area-VM-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 17, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@VSadov
Copy link
Member

VSadov commented May 17, 2024

Also cc @VSadov, I'm not familiar with the logic here, but wonder if it could be related to the new GC safe points.

With interruptible GC safe points we can stress test each and every safe point.
In the past we would skip a good portion of safe points - JIT helpers, direct calls, for example would not be stress tested.

So - the change to safe points could be involved here, but it could also expose some existing bug.
(we did not see that when the change was merged, but there is a possibility of some nondeterministic bug)

Does this happen without the PR change?

@jakobbotsch
Copy link
Member Author

Does this happen without the PR change?

I don't know, I haven't investigated.
Happened again in the recent libraries-jitstress run:
Pipeline run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=690710&view=results
Job name: net9.0-linux-Release-arm-disabler2r
Console log: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-705beed46fcf4015ad/LibraryImportGenerator.Unit.Tests/1/console.f4051e9b.log?helixlogtype=result

@VSadov
Copy link
Member

VSadov commented Jun 5, 2024

Always on arm32 and always in LibraryImportGenerator.Unit.Tests

The failure is strange though. It means that we try to initiate a stack walk in a fully-interruptible method, but the IP happens to not be in one of the interruptible ranges. It is hard to think of how this could happen as anything that leads to stack walks should at some point ask "is this IP interruptible?".

@VSadov
Copy link
Member

VSadov commented Jun 5, 2024

The last failure is interesting as it is not in a JIT stress run.

@jakobbotsch
Copy link
Member Author

The failure is strange though. It means that we try to initiate a stack walk in a fully-interruptible method, but the IP happens to not be in one of the interruptible ranges. It is hard to think of how this could happen as anything that leads to stack walks should at some point ask "is this IP interruptible?".

Here is 100% reliable repro on linux-x64: https://gist.github.com/jakobbotsch/b7b98e082d7be7a189d2edc23b375a89

I think the issue was exposed by #101761. It is similar in spirit to #102919, #103300, #104042, where the JIT is incorrectly handling the fact that we now have a managed helper call in a new location.
In this particular case the tailcall logic turns off GC before it starts placing arguments to the tailcall in the incoming parameters area (the GC ness of those args may not match the parameters, hence we need to do that). However, there is a block copy calling into the bulk write barrier helper after the JIT disables GC.
So I think the underlying problem is that the JIT turns off GC too early. We need similar reasoning as added by #103301.

I am looking into a fix.

@jakobbotsch jakobbotsch added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-VM-coreclr labels Jul 26, 2024
@jakobbotsch jakobbotsch assigned jakobbotsch and unassigned VSadov Jul 26, 2024
jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Jul 26, 2024
…opy with write barrier calls

When the JIT generates code for a tailcall it must generate code to
write the arguments into the incoming parameter area. Since the GC ness
of the arguments of the tailcall may not match the GC ness of the
parameters, we have to disable GC before we start writing these. This is
done by finding the earliest `GT_PUTARG_STK` node and placing the start
of the NOGC region right before it.

In addition, there is logic to take care of potential overlap between
the arguments and parameters. For example, if the call has an operand
that uses one of the parameters, then we must take care that we do not
override that parameter with the tailcall argument before the use of it.
To do so, we sometimes may need to introduce copies from the parameter
locals to locals on the stack frame.

This used to work fine, however, with dotnet#101761 we started transforming
block copies into managed calls in certain scenarios. It was possible
for the JIT to decide to introduce a copy to a local and for this
transformation to then kick in. This would cause us to end up with the
managed helper call after starting the nogc region. In checked builds
this would hit an assert during GC scan; in release builds, it would end
up with corrupted data.

The fix here is to make sure we insert the `GT_START_NOGC` after all the
potential temporary copies we may introduce as part of the tailcat stll
logic.

There was an additional assumption that the first `PUTARG_STK` operand
was the earliest one in execution order. That is not guaranteed, so this
change stops relying on that as well by introducing a new
`LIR::FirstNode` and using that to determine the earliest `PUTARG_STK`
node.

Fix dotnet#102370
Fix dotnet#104123
Fix dotnet#105441
@VSadov
Copy link
Member

VSadov commented Jul 26, 2024

@jakobbotsch - Also the first instruction of an epilog should not have GC forbidden. It was fixed in https://github.com/dotnet/runtime/pull/104336/files#r1664973906 , but right now reverted.

In theory that also could cause Assert failure: executionAborted - if we need to start stackwalk in a leaf method after a call and epilog starts right after that call as well.

@AndyAyersMS
Copy link
Member

@jakobbotsch - Also the first instruction of an epilog should not have GC forbidden. It was fixed in https://github.com/dotnet/runtime/pull/104336/files#r1664973906 , but right now reverted.

In theory that also could cause Assert failure: executionAborted - if we need to start stackwalk in a leaf method after a call and epilog starts right after that call as well.

Maybe you can PR that separately, if the fix for the reversion will take a while?

github-actions bot pushed a commit that referenced this issue Jul 26, 2024
…opy with write barrier calls

When the JIT generates code for a tailcall it must generate code to
write the arguments into the incoming parameter area. Since the GC ness
of the arguments of the tailcall may not match the GC ness of the
parameters, we have to disable GC before we start writing these. This is
done by finding the earliest `GT_PUTARG_STK` node and placing the start
of the NOGC region right before it.

In addition, there is logic to take care of potential overlap between
the arguments and parameters. For example, if the call has an operand
that uses one of the parameters, then we must take care that we do not
override that parameter with the tailcall argument before the use of it.
To do so, we sometimes may need to introduce copies from the parameter
locals to locals on the stack frame.

This used to work fine, however, with #101761 we started transforming
block copies into managed calls in certain scenarios. It was possible
for the JIT to decide to introduce a copy to a local and for this
transformation to then kick in. This would cause us to end up with the
managed helper call after starting the nogc region. In checked builds
this would hit an assert during GC scan; in release builds, it would end
up with corrupted data.

The fix here is to make sure we insert the `GT_START_NOGC` after all the
potential temporary copies we may introduce as part of the tailcat stll
logic.

There was an additional assumption that the first `PUTARG_STK` operand
was the earliest one in execution order. That is not guaranteed, so this
change stops relying on that as well by introducing a new
`LIR::FirstNode` and using that to determine the earliest `PUTARG_STK`
node.

Fix #102370
Fix #104123
Fix #105441
@VSadov
Copy link
Member

VSadov commented Jul 26, 2024

Maybe you can PR that separately, if the fix for the reversion will take a while?

I am considering that, but need to be sure it was not the part of the PR that made JIT stress unhappy.

hoyosjs pushed a commit that referenced this issue Jul 26, 2024
…calls in face of bulk copy with write barrier calls (#105572)

* JIT: Fix placement of `GT_START_NOGC` for tailcalls in face of bulk copy with write barrier calls

When the JIT generates code for a tailcall it must generate code to
write the arguments into the incoming parameter area. Since the GC ness
of the arguments of the tailcall may not match the GC ness of the
parameters, we have to disable GC before we start writing these. This is
done by finding the earliest `GT_PUTARG_STK` node and placing the start
of the NOGC region right before it.

In addition, there is logic to take care of potential overlap between
the arguments and parameters. For example, if the call has an operand
that uses one of the parameters, then we must take care that we do not
override that parameter with the tailcall argument before the use of it.
To do so, we sometimes may need to introduce copies from the parameter
locals to locals on the stack frame.

This used to work fine, however, with #101761 we started transforming
block copies into managed calls in certain scenarios. It was possible
for the JIT to decide to introduce a copy to a local and for this
transformation to then kick in. This would cause us to end up with the
managed helper call after starting the nogc region. In checked builds
this would hit an assert during GC scan; in release builds, it would end
up with corrupted data.

The fix here is to make sure we insert the `GT_START_NOGC` after all the
potential temporary copies we may introduce as part of the tailcat stll
logic.

There was an additional assumption that the first `PUTARG_STK` operand
was the earliest one in execution order. That is not guaranteed, so this
change stops relying on that as well by introducing a new
`LIR::FirstNode` and using that to determine the earliest `PUTARG_STK`
node.

Fix #102370
Fix #104123
Fix #105441
---------

Co-authored-by: Jakob Botsch Nielsen <jakob.botsch.nielsen@gmail.com>
@VSadov
Copy link
Member

VSadov commented Jul 29, 2024

The partial re-revert, that contains only JIT fixes has some GC stress failures (extra ones compared to a no-op GC stress off main). Suspiciously the failures are all on arm64 and with JITStress.

So, it is possible that it is the JIT changes for the No-GC range in the epilog that make JIT stress unhappy.

Maybe the "fix" was incomplete and there is one more thing, perhaps in in JIT stress itself, that indirectly relies on old behavior.

@VSadov
Copy link
Member

VSadov commented Jul 30, 2024

I am considering that, but need to be sure it was not the part of the PR that made JIT stress unhappy.

Turns out it was the "fix" that exposed bad GC info on the instruction right after CORINFO_HELP_FAIL_FAST. It was always possible to have it incorrect, but as long as we treated the first instruction of epilog as uninterruptible, it was mitigating the issue.

Treating the first instruction of epilog as uninterruptible is not the right thing to do, in general, especially after calls that could return or could do GC.
CORINFO_HELP_FAIL_FAST is fairly special in that sense - it does not return and does not do GC, but also the way we emit this call very late for the GS cookie check often results in corrupted GC info on the next instruction.

The part that CORINFO_HELP_FAIL_FAST is a NoGC helper saves us in partially interruptible code since emitting the call does not result in a GC safe point. In Fully interruptible code, the bad GC info is observable (although only on a single instruction), thus we only see problems in rare GC-JIT stress combinations.

I have a few solutions in mind, but need to figure what actually works more naturally.
Re: #105596 (comment)

@github-actions github-actions bot locked and limited conversation to collaborators Aug 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs Known Build Error Use this to report build issues in the .NET Helix tab
Projects
None yet
6 participants