Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert failure: CheckInstrBytePattern(epilogBase[offset] & X86_INSTR_RET, X86_INSTR_RET, epilogBase[offset]) || CheckInstrBytePattern(epilogBase[offset], X86_INSTR_JMP_NEAR_REL32, epilogBase[offset]) || CheckInstrWord(*PTR_WORD(epilogBase + offset), X86_INSTR_w_JMP_FAR_IND_IMM) #68431

Closed
BruceForstall opened this issue Apr 23, 2022 · 5 comments · Fixed by #72592

Comments

@BruceForstall
Copy link
Member

I see a failure in baseservices\threading\generics\Monitor\EnterExit12\EnterExit12.dll if run under GCStress=C on x86 in a loop, within 10 or so iterations:

c:\gh\runtime2\artifacts\tests\coreclr\windows.x86.Checked\Tests\Core_Root\corerun.exe C:\gh\runtime2\artifacts\tests\coreclr\windows.x86.Checked\baseservices\threading\generics\Monitor\EnterExit12\EnterExit12.dll

Assert failure(PID 90252 [0x0001608c], Thread: 99344 [0x18410]): CheckInstrBytePattern(epilogBase[offset] & X86_INSTR_RET, X86_INSTR_RET, epilogBase[offset]) || CheckInstrBytePattern(epilogBase[offset], X86_INSTR_JMP_NEAR_REL32, epilogBase[offset]) || CheckInstrWord(*PTR_WORD(epilogBase + offset), X86_INSTR_w_JMP_FAR_IND_IMM)

CORECLR! UnwindEspFrameEpilog + 0x1DC (0x5a8eab4c)
CORECLR! UnwindStackFrame + 0x230 (0x5a8eaf50)
CORECLR! DoGcStress + 0x263 (0x5aad21fb)
CORECLR! OnGcCoverageInterrupt + 0x160 (0x5aad284e)
CORECLR! IsGcMarker + 0x93 (0x5a8ff26f)
CORECLR! CLRVectoredExceptionHandlerShim + 0xC5 (0x5a8f46f5)
NTDLL! RtlIpv4AddressToStringA + 0x132 (0x778ae5a2)
NTDLL! RtlUnwind + 0x1D1 (0x778a9261)
NTDLL! KiUserExceptionDispatcher + 0x26 (0x778b7026)
SYSTEM.PRIVATE.CORELIB! <no symbol> + 0x0 (0x59c0de2c)
    File: C:\gh\runtime2\src\coreclr\vm\eetwain.cpp Line: 3531
    Image: c:\gh\runtime2\artifacts\tests\coreclr\windows.x86.Checked\Tests\Core_Root\corerun.exe

with script:

set COMPlus_TieredCompilation=0
set COMPlus_GCStress=C

set _count=0
:retry_loop
echo ========= %_count%
c:\gh\runtime2\artifacts\tests\coreclr\windows.x86.Checked\Tests\Core_Root\corerun.exe C:\gh\runtime2\artifacts\tests\coreclr\windows.x86.Checked\baseservices\threading\generics\Monitor\EnterExit12\EnterExit12.dll
if not %errorlevel%==100 goto :eof
@set /A _count=_count + 1
@goto retry_loop

I see this show up in the CI, but not frequently. E.g.,

https://dev.azure.com/dnceng/public/_build/results?buildId=1683920&view=ms.vss-test-web.build-test-results-tab&runId=46090658&resultId=101234&paneView=debug

This assert was hit before: #12953

In the debugger, it looks like we're trying to unwind from the first instruction of the prolog, which is a gcstress hlt instruction:

C:\gh\runtime\src\tests\baseservices\threading\generics\Monitor\EnterExit12.cs @ 46:
08743080 50              push    eax
08743081 890c24          mov     dword ptr [esp],ecx
08743084 8b490c          mov     ecx,dword ptr [ecx+0Ch]
08743087 8b1424          mov     edx,dword ptr [esp]
0874308a 8b5210          mov     edx,dword ptr [edx+10h]
0874308d 3909            cmp     dword ptr [ecx],ecx
0874308f ff156070b508    call    dword ptr ds:[8B57060h] (TestHelper.Consumer(System.Object), mdToken: 06000014)

C:\gh\runtime\src\tests\baseservices\threading\generics\Monitor\EnterExit12.cs @ 47:
>>> 08743095 59              pop     ecx (gcstress)
08743096 c3              ret

0:018> u 0x8743080 0x8743097
08743080 50              push    eax
08743081 890c24          mov     dword ptr [esp],ecx
08743084 8b490c          mov     ecx,dword ptr [ecx+0Ch]
08743087 8b1424          mov     edx,dword ptr [esp]
0874308a 8b5210          mov     edx,dword ptr [edx+10h]
0874308d 3909            cmp     dword ptr [ecx],ecx
0874308f ff156070b508    call    dword ptr ds:[8B57060h]
08743095 f4              hlt
08743096 c3              ret

so maybe the unwind code doesn't expect to see hlt?

@AndyAyersMS @janvorli

@BruceForstall BruceForstall added this to the 7.0.0 milestone Apr 23, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Apr 23, 2022
@ghost
Copy link

ghost commented Apr 23, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

I see a failure in baseservices\threading\generics\Monitor\EnterExit12\EnterExit12.dll if run under GCStress=C on x86 in a loop, within 10 or so iterations:

c:\gh\runtime2\artifacts\tests\coreclr\windows.x86.Checked\Tests\Core_Root\corerun.exe C:\gh\runtime2\artifacts\tests\coreclr\windows.x86.Checked\baseservices\threading\generics\Monitor\EnterExit12\EnterExit12.dll

Assert failure(PID 90252 [0x0001608c], Thread: 99344 [0x18410]): CheckInstrBytePattern(epilogBase[offset] & X86_INSTR_RET, X86_INSTR_RET, epilogBase[offset]) || CheckInstrBytePattern(epilogBase[offset], X86_INSTR_JMP_NEAR_REL32, epilogBase[offset]) || CheckInstrWord(*PTR_WORD(epilogBase + offset), X86_INSTR_w_JMP_FAR_IND_IMM)

CORECLR! UnwindEspFrameEpilog + 0x1DC (0x5a8eab4c)
CORECLR! UnwindStackFrame + 0x230 (0x5a8eaf50)
CORECLR! DoGcStress + 0x263 (0x5aad21fb)
CORECLR! OnGcCoverageInterrupt + 0x160 (0x5aad284e)
CORECLR! IsGcMarker + 0x93 (0x5a8ff26f)
CORECLR! CLRVectoredExceptionHandlerShim + 0xC5 (0x5a8f46f5)
NTDLL! RtlIpv4AddressToStringA + 0x132 (0x778ae5a2)
NTDLL! RtlUnwind + 0x1D1 (0x778a9261)
NTDLL! KiUserExceptionDispatcher + 0x26 (0x778b7026)
SYSTEM.PRIVATE.CORELIB! <no symbol> + 0x0 (0x59c0de2c)
    File: C:\gh\runtime2\src\coreclr\vm\eetwain.cpp Line: 3531
    Image: c:\gh\runtime2\artifacts\tests\coreclr\windows.x86.Checked\Tests\Core_Root\corerun.exe

with script:

set COMPlus_TieredCompilation=0
set COMPlus_GCStress=C

set _count=0
:retry_loop
echo ========= %_count%
c:\gh\runtime2\artifacts\tests\coreclr\windows.x86.Checked\Tests\Core_Root\corerun.exe C:\gh\runtime2\artifacts\tests\coreclr\windows.x86.Checked\baseservices\threading\generics\Monitor\EnterExit12\EnterExit12.dll
if not %errorlevel%==100 goto :eof
@set /A _count=_count + 1
@goto retry_loop

I see this show up in the CI, but not frequently. E.g.,

https://dev.azure.com/dnceng/public/_build/results?buildId=1683920&view=ms.vss-test-web.build-test-results-tab&runId=46090658&resultId=101234&paneView=debug

This assert was hit before: #12953

In the debugger, it looks like we're trying to unwind from the first instruction of the prolog, which is a gcstress hlt instruction:

C:\gh\runtime\src\tests\baseservices\threading\generics\Monitor\EnterExit12.cs @ 46:
08743080 50              push    eax
08743081 890c24          mov     dword ptr [esp],ecx
08743084 8b490c          mov     ecx,dword ptr [ecx+0Ch]
08743087 8b1424          mov     edx,dword ptr [esp]
0874308a 8b5210          mov     edx,dword ptr [edx+10h]
0874308d 3909            cmp     dword ptr [ecx],ecx
0874308f ff156070b508    call    dword ptr ds:[8B57060h] (TestHelper.Consumer(System.Object), mdToken: 06000014)

C:\gh\runtime\src\tests\baseservices\threading\generics\Monitor\EnterExit12.cs @ 47:
>>> 08743095 59              pop     ecx (gcstress)
08743096 c3              ret

0:018> u 0x8743080 0x8743097
08743080 50              push    eax
08743081 890c24          mov     dword ptr [esp],ecx
08743084 8b490c          mov     ecx,dword ptr [ecx+0Ch]
08743087 8b1424          mov     edx,dword ptr [esp]
0874308a 8b5210          mov     edx,dword ptr [edx+10h]
0874308d 3909            cmp     dword ptr [ecx],ecx
0874308f ff156070b508    call    dword ptr ds:[8B57060h]
08743095 f4              hlt
08743096 c3              ret

so maybe the unwind code doesn't expect to see hlt?

@AndyAyersMS @janvorli

Author: BruceForstall
Assignees: -
Labels:

arch-x86, os-windows, GCStress, area-CodeGen-coreclr, untriaged

Milestone: 7.0.0

@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 23, 2022
@AndyAyersMS
Copy link
Member

so maybe the unwind code doesn't expect to see hlt?

Seems likely.

I don't see how we could get this far into UnwindEspFrameEpilog without hitting some other assert first., but I'm probably overlooking something.

@janvorli
Copy link
Member

From the dump I got from the same issue in #70968, I can see that there are two threads running GC stress in the same function. One invoked the vectored exception handler on the pop eax and the other (the one that crashed) on the ret.
We don't seem to be protecting against a race in instrumenting the same code by multiple threads and unwinding of the epilog.
@AndyAyersMS do you remember how this is supposed to work?
I guess this issue may be the case described by the comment here:

// We are not at a GC safe point so we can't Suspend EE (Suspend EE will yield to GC).
// But we still have to update the GC Stress instruction. We do it directly without suspending
// other threads, which means a race on updating is still possible. But for X86 the window of
// race is so small that we could ignore it. We need a better solution if the race becomes a real problem.
// see details about <GCStress instruction update race> in comments above

@jakobbotsch
Copy link
Member

@janvorli do you know why the race is only a problem on x86?

@jakobbotsch jakobbotsch added area-VM-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Jul 21, 2022
@jakobbotsch
Copy link
Member

The race here seems to be one that is hinted at here:

// @Todo: possible race here, might need to be fixed if it become a problem.
// It could become a problem if 64bit does partially interrupt work.
// OK, we have the MD, mark the instruction after the CALL
// appropriately
ReplaceInstrAfterCall(nextInstrWriterHolder.GetRW(), targetMD);

As @jkotas has explained to me, for partially interruptible methods there is a mismatch between where GC stress runs GCs and where normal return address hijacking would run GCs. GC stress runs the GC on the first instruction after a call returns, where there for partially interruptible methods is no GC info that says to protect GC pointers in return registers. It means GC stress needs to do some special work to figure out that these need to be protected.

The way it does that is the following:

  • For direct call sites, in GCCoverageInfo::SprinkleBreakpoints it gets the target MD of each call site and places a special illegal instruction right after the call that the GC stress handler will use to figure out which (if any) registers have GC pointers that need protection
  • For indirect call sites, in GCCoverageInfo::SprinkleBreakpoint it will first place an illegal instruction on the call instruction so that the GC stress handler will break there. Once the GC stress handler breaks, it computes the target address and gets the target MD from that, and then places the illegal instruction on the next instruction like above.

GCCoverageInfo::SprinkleBreakpoints runs right after jitting and should therefore not race with anything. However, the GC stress handler can race with any other thread accessing the same function. That makes the latter problematic on x86 where unwinding reads the epilog code to work. In this particular case thread A is about to unwind through the epilog when thread B stops on an indirect call right before the epilog. Thread B then overwrites the first instruction of the epilog, causing thread A to unwind incorrectly.

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Jul 21, 2022
For partially interruptible methods there is a mismatch between where GC
stress runs GCs and where normal return address hijacking would run GCs.
GC stress runs the GC on the first instruction after a call returns,
where there for partially interruptible methods is no GC info that says
to protect GC pointers in return registers. It means GC stress needs to
do some special work to figure out that these need to be protected.

The way it does that is the following:

* For direct call sites, in GCCoverageInfo::SprinkleBreakpoints it gets
  the target MD of each call site and places a special illegal
  instruction right after the call that the GC stress handler will use
  to figure out which (if any) registers have GC pointers that need
  protection

* For indirect call sites, in GCCoverageInfo::SprinkleBreakpoint it will
  first place an illegal instruction on the call instruction so that the
  GC stress handler will break there. Once the GC stress handler breaks,
  it computes the target address and gets the target MD from that, and
  then places the illegal instruction on the next instruction like
  above.

GCCoverageInfo::SprinkleBreakpoints runs right after jitting and should
therefore not race with anything. However, the GC stress handler can
race with any other thread accessing the same function. That makes the
latter problematic on x86 where unwinding reads the epilog code to work.
In this particular case thread A is about to unwind through the epilog
when thread B stops on an indirect call right before the epilog. Thread
B then overwrites the first instruction of the epilog, causing thread A
to unwind incorrectly.

UnwindStackFrame already has access to the GC stress saved code and is
actually already using it for other unwinding. To fix the issue, make it
use the saved code for unwinding epilogs as well.

Fix dotnet#68431
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 21, 2022
jakobbotsch added a commit that referenced this issue Jul 21, 2022
For partially interruptible methods there is a mismatch between where GC
stress runs GCs and where normal return address hijacking would run GCs.
GC stress runs the GC on the first instruction after a call returns,
where there for partially interruptible methods is no GC info that says
to protect GC pointers in return registers. It means GC stress needs to
do some special work to figure out that these need to be protected.

The way it does that is the following:

* For direct call sites, in GCCoverageInfo::SprinkleBreakpoints it gets
  the target MD of each call site and places a special illegal
  instruction right after the call that the GC stress handler will use
  to figure out which (if any) registers have GC pointers that need
  protection

* For indirect call sites, in GCCoverageInfo::SprinkleBreakpoint it will
  first place an illegal instruction on the call instruction so that the
  GC stress handler will break there. Once the GC stress handler breaks,
  it computes the target address and gets the target MD from that, and
  then places the illegal instruction on the next instruction like
  above.

GCCoverageInfo::SprinkleBreakpoints runs right after jitting and should
therefore not race with anything. However, the GC stress handler can
race with any other thread accessing the same function. That makes the
latter problematic on x86 where unwinding reads the epilog code to work.
In this particular case thread A is about to unwind through the epilog
when thread B stops on an indirect call right before the epilog. Thread
B then overwrites the first instruction of the epilog, causing thread A
to unwind incorrectly.

UnwindStackFrame already has access to the GC stress saved code and is
actually already using it for other unwinding. To fix the issue, make it
use the saved code for unwinding epilogs as well.

Fix #68431
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 21, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants