Assert failure: executionAborted in `GcInfoDecoder::EnumerateLiveSlots` #102370

jakobbotsch · 2024-05-17T08:39:01Z

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=678333&view=results
Build error leg or test failing:
Example console log: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-102261-merge-e822cbb23a0f465186/LibraryImportGenerator.Unit.Tests/1/console.c3aa79dd.log?helixlogtype=result

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: LibraryImportGenerator.Unit.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  LibraryImportGenerator.Unit.Tests (found 185 of 190 test cases)
  Starting:    LibraryImportGenerator.Unit.Tests (parallel test collections = on [2 threads], stop on fail = off)
    LibraryImportGenerator.UnitTests.Compiles.ValidateSnippetsWithMarshalType [SKIP]
      No current scenarios to test.

Assert failure(PID 28 [0x0000001c], Thread: 73 [0x0049]): executionAborted
    File: /__w/1/s/src/coreclr/vm/gcinfodecoder.cpp:801
    Image: /root/helix/work/correlation/dotnet

Maybe related to #101890?

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "executionAborted",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=678333
Error message validated: [executionAborted]
Result validation: ❌ Known issue did not match with the provided build.
Validation performed at: 5/17/2024 8:39:56 AM UTC

Report

Build	Definition	Test	Pull Request
2503080	dotnet-runtime	LibraryImportGenerator.Unit.Tests.WorkItemExecution
747517	dotnet/runtime	System.Formats.Tar.Tests.WorkItemExecution	#105099
2490036	dotnet-runtime	System.Text.RegularExpressions.Tests.WorkItemExecution
2488046	dotnet-runtime	LibraryImportGenerator.Unit.Tests.WorkItemExecution
2487307	dotnet-runtime	LibraryImportGenerator.Unit.Tests.WorkItemExecution

Summary

24-Hour Hit Count	7-Day Hit Count	1-Month Count
1	1	5

The text was updated successfully, but these errors were encountered:

jakobbotsch · 2024-05-17T08:42:19Z

Also cc @VSadov, I'm not familiar with the logic here, but wonder if it could be related to the new GC safe points.

dotnet-policy-service · 2024-05-17T15:05:14Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

VSadov · 2024-05-17T17:34:30Z

Also cc @VSadov, I'm not familiar with the logic here, but wonder if it could be related to the new GC safe points.

With interruptible GC safe points we can stress test each and every safe point.
In the past we would skip a good portion of safe points - JIT helpers, direct calls, for example would not be stress tested.

So - the change to safe points could be involved here, but it could also expose some existing bug.
(we did not see that when the change was merged, but there is a possibility of some nondeterministic bug)

Does this happen without the PR change?

jakobbotsch · 2024-05-29T20:19:46Z

Does this happen without the PR change?

I don't know, I haven't investigated.
Happened again in the recent libraries-jitstress run:
Pipeline run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=690710&view=results
Job name: net9.0-linux-Release-arm-disabler2r
Console log: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-705beed46fcf4015ad/LibraryImportGenerator.Unit.Tests/1/console.f4051e9b.log?helixlogtype=result

jakobbotsch · 2024-05-30T17:24:42Z

Saw one on linux-x64 as well in https://dev.azure.com/dnceng-public/public/_build/results?buildId=692730&view=results.
Console log: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-102808-merge-ca406cfc9a8448aea7/LibraryImportGenerator.Unit.Tests/1/console.b56c6cd1.log?helixlogtype=result

davidwrighton · 2024-06-05T21:09:11Z

Happened again on linux-arm https://dev.azure.com/dnceng-public/public/_build/results?buildId=698579&view=ms.vss-test-web.build-test-results-tab&runId=17382094&resultId=208376&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Console log
https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-101580-merge-24599464203f4fedbf/LibraryImportGenerator.Unit.Tests/1/console.4e5f55dd.log?helixlogtype=result

VSadov · 2024-06-05T21:18:40Z

Always on arm32 and always in LibraryImportGenerator.Unit.Tests

The failure is strange though. It means that we try to initiate a stack walk in a fully-interruptible method, but the IP happens to not be in one of the interruptible ranges. It is hard to think of how this could happen as anything that leads to stack walks should at some point ask "is this IP interruptible?".

VSadov · 2024-06-05T21:21:36Z

The last failure is interesting as it is not in a JIT stress run.

jakobbotsch · 2024-07-26T09:41:31Z

The failure is strange though. It means that we try to initiate a stack walk in a fully-interruptible method, but the IP happens to not be in one of the interruptible ranges. It is hard to think of how this could happen as anything that leads to stack walks should at some point ask "is this IP interruptible?".

Here is 100% reliable repro on linux-x64: https://gist.github.com/jakobbotsch/b7b98e082d7be7a189d2edc23b375a89

I think the issue was exposed by #101761. It is similar in spirit to #102919, #103300, #104042, where the JIT is incorrectly handling the fact that we now have a managed helper call in a new location.
In this particular case the tailcall logic turns off GC before it starts placing arguments to the tailcall in the incoming parameters area (the GC ness of those args may not match the parameters, hence we need to do that). However, there is a block copy calling into the bulk write barrier helper after the JIT disables GC.
So I think the underlying problem is that the JIT turns off GC too early. We need similar reasoning as added by #103301.

I am looking into a fix.

…opy with write barrier calls When the JIT generates code for a tailcall it must generate code to write the arguments into the incoming parameter area. Since the GC ness of the arguments of the tailcall may not match the GC ness of the parameters, we have to disable GC before we start writing these. This is done by finding the earliest `GT_PUTARG_STK` node and placing the start of the NOGC region right before it. In addition, there is logic to take care of potential overlap between the arguments and parameters. For example, if the call has an operand that uses one of the parameters, then we must take care that we do not override that parameter with the tailcall argument before the use of it. To do so, we sometimes may need to introduce copies from the parameter locals to locals on the stack frame. This used to work fine, however, with dotnet#101761 we started transforming block copies into managed calls in certain scenarios. It was possible for the JIT to decide to introduce a copy to a local and for this transformation to then kick in. This would cause us to end up with the managed helper call after starting the nogc region. In checked builds this would hit an assert during GC scan; in release builds, it would end up with corrupted data. The fix here is to make sure we insert the `GT_START_NOGC` after all the potential temporary copies we may introduce as part of the tailcat stll logic. There was an additional assumption that the first `PUTARG_STK` operand was the earliest one in execution order. That is not guaranteed, so this change stops relying on that as well by introducing a new `LIR::FirstNode` and using that to determine the earliest `PUTARG_STK` node. Fix dotnet#102370 Fix dotnet#104123 Fix dotnet#105441

VSadov · 2024-07-26T14:03:18Z

@jakobbotsch - Also the first instruction of an epilog should not have GC forbidden. It was fixed in https://github.com/dotnet/runtime/pull/104336/files#r1664973906 , but right now reverted.

In theory that also could cause Assert failure: executionAborted - if we need to start stackwalk in a leaf method after a call and epilog starts right after that call as well.

AndyAyersMS · 2024-07-26T15:32:46Z

@jakobbotsch - Also the first instruction of an epilog should not have GC forbidden. It was fixed in https://github.com/dotnet/runtime/pull/104336/files#r1664973906 , but right now reverted.

In theory that also could cause Assert failure: executionAborted - if we need to start stackwalk in a leaf method after a call and epilog starts right after that call as well.

Maybe you can PR that separately, if the fix for the reversion will take a while?

…opy with write barrier calls When the JIT generates code for a tailcall it must generate code to write the arguments into the incoming parameter area. Since the GC ness of the arguments of the tailcall may not match the GC ness of the parameters, we have to disable GC before we start writing these. This is done by finding the earliest `GT_PUTARG_STK` node and placing the start of the NOGC region right before it. In addition, there is logic to take care of potential overlap between the arguments and parameters. For example, if the call has an operand that uses one of the parameters, then we must take care that we do not override that parameter with the tailcall argument before the use of it. To do so, we sometimes may need to introduce copies from the parameter locals to locals on the stack frame. This used to work fine, however, with #101761 we started transforming block copies into managed calls in certain scenarios. It was possible for the JIT to decide to introduce a copy to a local and for this transformation to then kick in. This would cause us to end up with the managed helper call after starting the nogc region. In checked builds this would hit an assert during GC scan; in release builds, it would end up with corrupted data. The fix here is to make sure we insert the `GT_START_NOGC` after all the potential temporary copies we may introduce as part of the tailcat stll logic. There was an additional assumption that the first `PUTARG_STK` operand was the earliest one in execution order. That is not guaranteed, so this change stops relying on that as well by introducing a new `LIR::FirstNode` and using that to determine the earliest `PUTARG_STK` node. Fix #102370 Fix #104123 Fix #105441

VSadov · 2024-07-26T18:49:39Z

Maybe you can PR that separately, if the fix for the reversion will take a while?

I am considering that, but need to be sure it was not the part of the PR that made JIT stress unhappy.

…calls in face of bulk copy with write barrier calls (#105572) * JIT: Fix placement of `GT_START_NOGC` for tailcalls in face of bulk copy with write barrier calls When the JIT generates code for a tailcall it must generate code to write the arguments into the incoming parameter area. Since the GC ness of the arguments of the tailcall may not match the GC ness of the parameters, we have to disable GC before we start writing these. This is done by finding the earliest `GT_PUTARG_STK` node and placing the start of the NOGC region right before it. In addition, there is logic to take care of potential overlap between the arguments and parameters. For example, if the call has an operand that uses one of the parameters, then we must take care that we do not override that parameter with the tailcall argument before the use of it. To do so, we sometimes may need to introduce copies from the parameter locals to locals on the stack frame. This used to work fine, however, with #101761 we started transforming block copies into managed calls in certain scenarios. It was possible for the JIT to decide to introduce a copy to a local and for this transformation to then kick in. This would cause us to end up with the managed helper call after starting the nogc region. In checked builds this would hit an assert during GC scan; in release builds, it would end up with corrupted data. The fix here is to make sure we insert the `GT_START_NOGC` after all the potential temporary copies we may introduce as part of the tailcat stll logic. There was an additional assumption that the first `PUTARG_STK` operand was the earliest one in execution order. That is not guaranteed, so this change stops relying on that as well by introducing a new `LIR::FirstNode` and using that to determine the earliest `PUTARG_STK` node. Fix #102370 Fix #104123 Fix #105441 --------- Co-authored-by: Jakob Botsch Nielsen <jakob.botsch.nielsen@gmail.com>

VSadov · 2024-07-29T18:20:50Z

The partial re-revert, that contains only JIT fixes has some GC stress failures (extra ones compared to a no-op GC stress off main). Suspiciously the failures are all on arm64 and with JITStress.

So, it is possible that it is the JIT changes for the No-GC range in the epilog that make JIT stress unhappy.

Maybe the "fix" was incomplete and there is one more thing, perhaps in in JIT stress itself, that indirectly relies on old behavior.

VSadov · 2024-07-30T19:17:33Z

I am considering that, but need to be sure it was not the part of the PR that made JIT stress unhappy.

Turns out it was the "fix" that exposed bad GC info on the instruction right after CORINFO_HELP_FAIL_FAST. It was always possible to have it incorrect, but as long as we treated the first instruction of epilog as uninterruptible, it was mitigating the issue.

Treating the first instruction of epilog as uninterruptible is not the right thing to do, in general, especially after calls that could return or could do GC.
CORINFO_HELP_FAIL_FAST is fairly special in that sense - it does not return and does not do GC, but also the way we emit this call very late for the GS cookie check often results in corrupted GC info on the next instruction.

The part that CORINFO_HELP_FAIL_FAST is a NoGC helper saves us in partially interruptible code since emitting the call does not result in a GC safe point. In Fully interruptible code, the bad GC info is observable (although only on a single instruction), thus we only see problems in rare GC-JIT stress combinations.

I have a few solutions in mind, but need to figure what actually works more naturally.
Re: #105596 (comment)

jakobbotsch added blocking-clean-ci-optional Blocking optional rolling runs Known Build Error Use this to report build issues in the .NET Helix tab labels May 17, 2024

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 17, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 17, 2024

jkotas added area-VM-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 17, 2024

VSadov self-assigned this Jun 15, 2024

jakobbotsch mentioned this issue Jun 17, 2024

JIT: Avoid check for returns in methods with pinvokes in LSRA #103559

Merged

mangod9 removed the untriaged New issue has not been triaged by the area owner label Jul 3, 2024

mangod9 added this to the 9.0.0 milestone Jul 3, 2024

jakobbotsch mentioned this issue Jul 21, 2024

JIT: Have lowering set up IR for post-indexed addressing and make strength reduced IV updates amenable to post-indexed addressing #105185

Merged

jakobbotsch added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-VM-coreclr labels Jul 26, 2024

jakobbotsch assigned jakobbotsch and unassigned VSadov Jul 26, 2024

This was referenced Jul 26, 2024

Test failure: NullReferenceException in Internal.JitInterface.CorInfoImpl._beginInlining #105441

Closed

Occasionally hitting error MSB6006: "csc.dll" exited with code 139 on linux (w/ GCDynamicAdaptationMode=1) #104123

Closed

jakobbotsch mentioned this issue Jul 26, 2024

JIT: Fix placement of GT_START_NOGC for tailcalls in face of bulk copy with write barrier calls #105551

Merged

AndyAyersMS closed this as completed in #105551 Jul 26, 2024

AndyAyersMS closed this as completed in 99c9f5b Jul 26, 2024

hoyosjs mentioned this issue Jul 26, 2024

[release/9.0-preview7] JIT: Fix placement of GT_START_NOGC for tailcalls in face of bulk copy with write barrier calls #105572

Merged

2 tasks

VSadov mentioned this issue Jul 27, 2024

Partial re-revert of #104336. Only JIT fixes are included. #105596

Merged

github-actions bot locked and limited conversation to collaborators Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assert failure: executionAborted in `GcInfoDecoder::EnumerateLiveSlots` #102370

Assert failure: executionAborted in `GcInfoDecoder::EnumerateLiveSlots` #102370

jakobbotsch commented May 17, 2024 •

edited by build-analysis bot

Loading

jakobbotsch commented May 17, 2024

dotnet-policy-service bot commented May 17, 2024

VSadov commented May 17, 2024 •

edited

Loading

jakobbotsch commented May 29, 2024

jakobbotsch commented May 30, 2024

davidwrighton commented Jun 5, 2024

VSadov commented Jun 5, 2024 •

edited

Loading

VSadov commented Jun 5, 2024

jakobbotsch commented Jul 26, 2024

VSadov commented Jul 26, 2024

AndyAyersMS commented Jul 26, 2024

VSadov commented Jul 26, 2024

VSadov commented Jul 29, 2024

VSadov commented Jul 30, 2024 •

edited

Loading

Assert failure: executionAborted in GcInfoDecoder::EnumerateLiveSlots #102370

Assert failure: executionAborted in GcInfoDecoder::EnumerateLiveSlots #102370

Comments

jakobbotsch commented May 17, 2024 • edited by build-analysis bot Loading

Build Information

Error Message

Known issue validation

Report

Summary

jakobbotsch commented May 17, 2024

dotnet-policy-service bot commented May 17, 2024

VSadov commented May 17, 2024 • edited Loading

jakobbotsch commented May 29, 2024

jakobbotsch commented May 30, 2024

davidwrighton commented Jun 5, 2024

VSadov commented Jun 5, 2024 • edited Loading

VSadov commented Jun 5, 2024

jakobbotsch commented Jul 26, 2024

VSadov commented Jul 26, 2024

AndyAyersMS commented Jul 26, 2024

VSadov commented Jul 26, 2024

VSadov commented Jul 29, 2024

VSadov commented Jul 30, 2024 • edited Loading

Assert failure: executionAborted in `GcInfoDecoder::EnumerateLiveSlots` #102370

Assert failure: executionAborted in `GcInfoDecoder::EnumerateLiveSlots` #102370

jakobbotsch commented May 17, 2024 •

edited by build-analysis bot

Loading

VSadov commented May 17, 2024 •

edited

Loading

VSadov commented Jun 5, 2024 •

edited

Loading

VSadov commented Jul 30, 2024 •

edited

Loading