-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure: JIT\\Regression\\JitBlue\\GitHub_27924\\GitHub_27924\\GitHub_27924.cmd #36681
Comments
@BruceForstall any update on this? This failed again on: https://helix.dot.net/api/2019-06-17/jobs/eb11af78-3bd5-4fce-ad89-49b91832128d/workitems/JIT.Regression/console |
@safern I haven't had a chance to look at this yet. Hopefully this week. |
failed again in job: runtime-coreclr jitstress 20200610.1 failed test: JIT\Regression\JitBlue\GitHub_27924\GitHub_27924\GitHub_27924.cmd Error message
|
Failures are in |
(Note: Windows arm issue) Even though this configuration sets:
the test also sets
(it's the only one in the coreclr test tree to do so, and it's a Pri-0 test, so it probably gets more GCStress runs by far than any other test.) I couldn't repro it initially, but when I tweaked some of the loops I was able to repro about every 10+ runs, and caught it in the debugger. The assert is in GCstress code in
There are many threads. The thread being walked that causes the assert is:
in this case, the first "real" frame is
So, we're actually AT the last call instruction, in a non-interruptible region, not after the call instruction. So, how do we get here? Is there some kind of initial stack walk setup that is getting confused on this call? What's also interesting is that while we are running with [edit: it looks like when concurrent GC is "forced" (I can't tell why it would be considered "forced" in this case),
Other information at this point:
@janvorli @AndyAyersMS Any suggestions for what to look at next? |
There's something strange about the EE suspension. Through multiple GCs, there are 3 threads waiting for
Two of these continue and return to the caller, a safe point, before suspension ends. The third, TID 28d0, does not, so we end up suspended but not at a safe point. |
I can see that GC stress mode C injects the special invalid instructions at call sites too, but it should be detected as a special case (the instruction is different for call sites). Can you see any local variable values at the following frame? I would like to see if it was correctly detected as a call site (
I was wondering if by some mistake, the GCCoverageInfo::SprinkleBreakpoints doesn't put in an incorrect invalid instruction that would indicate it is a safe point instead of a call. Looking at the code in replaceSafePointInstructionWithGcStressInstr that gets called from the GCCoverageInfo::SprinkleBreakpoints, it seems it should work correctly for our call instruction (instr code 0x4798): runtime/src/coreclr/src/vm/gccover.cpp Lines 814 to 828 in 2799404
Although there seems to actually be a bug in this code. When it detects a call through immediate, it sets instructionIsACallThroughRegister instead of instructionIsACallThroughImmediate. But this should not cause the issue we are seeing. |
Unfortunately, I lost the above repro and didn't create a dump. I tried repro'ing again with more verbose stress log settings, but it didn't seem to repro, so I kept dialing back the settings and after many hours I was able to get another repro. Similar, but not identical, this time 2 threads starting at the same location, and one gets to a post-call safe point but the other doesn't. I saved a minidump this time. However, (almost) no locals in It sounds like one of your theories is that multiple threads are on the same gcstress injected illegal call instruction, and one doesn't get replaced properly when resuming or replacing that instruction? I was wondering if there was possibly a non-gcstress related gc suspension issue, but it would make more sense if the issue was gcstress only. |
What I really meant is that if the call stack is not misleading and the DoGcStress was really called for PC at the call, then it seems that the invalid instruction that was put at the call was not INTERRUPT_INSTR_CALL or INTERRUPT_INSTR_CALL_32, but INTERRUPT_INSTR or INTERRUPT_INSTR_32. Otherwise the DoGcStress would detect it was the call related one, set atCall=true and thus it would never reach the GCHeap::StressHeap call. |
failed again in job: runtime-coreclr jitstress 20200629.1 failed test: JIT\Regression\JitBlue\GitHub_27924\GitHub_27924\GitHub_27924.cmd Error message
|
@dotnet/jit-contrib we need somebody to look at this. |
It looks to me like |
Seems like we should re-enable this in a draft PR and see if we can catch some more failure instances? Or, I can try and repro this locally, though it might be a few days before I can get around to it. |
I can relatively easily reproduce this locally (looking now). |
Seeing if I can repro on Linux arm32. |
No luck after several thousand iterations. @BruceForstall you mentioned altering the test to make the failures repro more readily? |
I simply upped the loops, from I'm also running with:
which might all be required. |
I found a race condition in the GCStress infrastructure which I believe causes this problem. The details are related to what Jan suggested above about a problem with Consider the case of a partially interruptible function where we've replaced each call site with Multiple threads can enter It seems like this should be an all-platform issue, so I'm not sure why we only see it on Windows arm32 (except that it's very rare, and this particular test that exposes it (rarely) is run with GCStress much more frequently on Windows than Linux, and arm32 is probably a lot slower and potentially exposes the race more as a result). It seems like the solution should be to simply move the |
I couldn't repro with upped loop counts and also with some sideloaded stress.
Note for arm32 we make multiple reads of instrPtr so perhaps the race window is much larger? It also means the fix is a perhaps bit more complex; we need to be careful to make just one read, then (perhaps?) use interlocked ops to try and update, so there's just one "winner". |
The reason we read multiple times is to determine if we've got a 2-byte or a 4-byte instruction encoding, using |
With my fix of moving the |
I suppose we won't ever see torn reads so you are probably right about re-reads being fairly harmless.
That's great news. Thanks for digging into this one. |
There definitely are assumptions in the GCStress code about not seeing torn reads, and that any |
Consider the case of a partially interruptible function where we've replaced each call site with INTERRUPT_INSTR_CALL or INTERRUPT_INSTR_CALL_32. Even though the call site itself isn't a safe point, we put a GCStress instruction there so we can capture the call target at GC stress time, then set a specific GCStress instruction for the next instruction that will indicate how to GC protect any GC refs that are returned from the call. Multiple threads can enter `DoGcStress` at the same time. They all get just past the race check `if (!IsGcCoverageInterruptInstruction(instrPtr)) return;`. One goes ahead, and reads the instruction code for the address at which we're doing a GCStress. We set `atCall` to `true` if this matches one of our known GCStress instructions for partially-interruptible call sites. Later, if `atCall` is `true`, we write the original call instruction back to the code stream, write the distinguished next instruction GCStress instruction, and continue. If the other threads then read at the instruction address, they might read the actual call opcode, not the distinguished GCStress breakpoint instruction code. Then, they will set `atCall` to `false`. This will indicate that we are in a fully-interruptible location, or that no special call-site behavior is required. We go on, force a GC, and eventually get to the assert in `EECodeManager::EnumGcRefs` that we are at a GC safe point. The solution is to simply move the `if (!IsGcCoverageInterruptInstruction(instrPtr))` check below the read of `*instrPtr` to set `atCall`. Re-enable the GitHub_27924 test. Fixes dotnet#36681
Since this is GCStress only, moving to 6.0. |
Consider the case of a partially interruptible function where we've replaced each call site with INTERRUPT_INSTR_CALL or INTERRUPT_INSTR_CALL_32. Even though the call site itself isn't a safe point, we put a GCStress instruction there so we can capture the call target at GC stress time, then set a specific GCStress instruction for the next instruction that will indicate how to GC protect any GC refs that are returned from the call. Multiple threads can enter `DoGcStress` at the same time. They all get just past the race check `if (!IsGcCoverageInterruptInstruction(instrPtr)) return;`. One goes ahead, and reads the instruction code for the address at which we're doing a GCStress. We set `atCall` to `true` if this matches one of our known GCStress instructions for partially-interruptible call sites. Later, if `atCall` is `true`, we write the original call instruction back to the code stream, write the distinguished next instruction GCStress instruction, and continue. If the other threads then read at the instruction address, they might read the actual call opcode, not the distinguished GCStress breakpoint instruction code. Then, they will set `atCall` to `false`. This will indicate that we are in a fully-interruptible location, or that no special call-site behavior is required. We go on, force a GC, and eventually get to the assert in `EECodeManager::EnumGcRefs` that we are at a GC safe point. The solution is to simply move the `if (!IsGcCoverageInterruptInstruction(instrPtr))` check below the read of `*instrPtr` to set `atCall`. Re-enable the GitHub_27924 test. Fixes #36681
failed in job: runtime-coreclr jitstress 20200517.2
Error message
category:correctness
theme:gc-stress
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: