Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve last error for patchpoint helpers #75922

Merged
merged 3 commits into from
Sep 22, 2022

Conversation

AndyAyersMS
Copy link
Member

In stress modes (and eventually perhaps in normal uses) the jit may insert patchpoint helper calls in regions where last error is live. So the helpers need to preserve last error.

Because some invocations of the helpers may transition to OSR methods instead of returning, we can't use the normal macros for this.

Fixes #75828.

In stress modes (and eventually perhaps in normal uses) the jit may insert
patchpoint helper calls in regions where last error is live. So the helpers
need to preserve last error.

Because some invocations of the helpers may transition to OSR methods instead
of returning, we can't use the normal macros for this.

Fixes dotnet#75828.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 20, 2022
@ghost ghost assigned AndyAyersMS Sep 20, 2022
@ghost
Copy link

ghost commented Sep 20, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

In stress modes (and eventually perhaps in normal uses) the jit may insert patchpoint helper calls in regions where last error is live. So the helpers need to preserve last error.

Because some invocations of the helpers may transition to OSR methods instead of returning, we can't use the normal macros for this.

Fixes #75828.

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

@jakobbotsch PTAL
cc @dotnet/jit-contrib @jkotas

Seems like EnC could hit this issue too?

@@ -4916,6 +4916,9 @@ HCIMPLEND

void JIT_Patchpoint(int* counter, int ilOffset)
{
// BEGIN_PRESERVE_LAST_ERROR;
DWORD __dwLastError = ::GetLastError();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DWORD __dwLastError = ::GetLastError();
DWORD dwLastError = ::GetLastError();

@@ -5205,6 +5220,9 @@ void JIT_Patchpoint(int* counter, int ilOffset)
//
void JIT_PartialCompilationPatchpoint(int ilOffset)
{
// BEGIN_PRESERVE_LAST_ERROR;
DWORD __dwLastError = ::GetLastError();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DWORD __dwLastError = ::GetLastError();
DWORD dwLastError = ::GetLastError();

@jkotas
Copy link
Member

jkotas commented Sep 20, 2022

Seems like EnC could hit this issue too?

I do not see any obvious problem. EnC patch runs inside debugger step handler that is preserving last error:

DWORD dwLastError = GetLastError();

@AndyAyersMS
Copy link
Member Author

OSX seems to have some networking issues:

fatal: unable to access 'https://github.com/dotnet/runtime/': The requested URL returned error: 429
##[warning]Git fetch failed with exit code 128, back off 8.881 seconds before retry.

@stephentoub
Copy link
Member

Yeah, almost every PR today has failed most mac legs because of that. @MattGal is investigating.

@MattGal
Copy link
Member

MattGal commented Sep 20, 2022

Yeah, almost every PR today has failed most mac legs because of that. @MattGal is investigating.

The "normal" channels are being a bit slow but I have confirmed they're aware of the throttling problem and are trying to identify the cause presently.

@jkotas
Copy link
Member

jkotas commented Sep 21, 2022

Is this .NET 7 backport candidate? (Intermittent data corruption caused by OSR.)

@AndyAyersMS
Copy link
Member Author

Is this .NET 7 backport candidate? (Intermittent data corruption caused by OSR.)

I don't know if we can hit this with normal OSR.

The failure above is from a stress mode that places patchpoints in randomly selected blocks where the stack is empty, whereas OSR normally will only place patchpoints at blocks that are the start or end of a loop.

I'll spend some time today trying to work up a case; seems like the "end of loop" patchpoint (which is the default placement) might be a place we could hit this normally.

@jkotas
Copy link
Member

jkotas commented Sep 21, 2022

I would expect that something like this can hit this reliably:

int test()
{
    Marshal.SetLastSystemError(42);

    int fact = 1;
    for (int i = 0; i < 2_000_000_000; i++)
    {
       fact *= i;
    }

    return fact + Marshal.GetLastSystemError();
}

@AndyAyersMS
Copy link
Member Author

I would expect that something like this can hit this reliably:

Yes, thanks.

Repro is simple and deterministic; the failure happens in Tier0 code once it has invoked the patchpoint helper. Will add a test case.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Sep 21, 2022

For Linxu arm64 NAOT: msbuild crashed during restore:

/__w/1/s/.dotnet/sdk/7.0.100-preview.7.22377.5/MSBuild.dll /nologo -maxcpucount /m -verbosity:m /v:minimal /bl:/__w/1/s/artifacts/log/Release/ToolsetRestore.binlog /clp:Summary /clp:ErrorsOnly;NoSummary /nr:false /p:TreatWarningsAsErrors=true /p:ContinuousIntegrationBuild=true /p:__ToolsetLocationOutputFile=/__w/1/s/artifacts/toolset/8.0.0-beta.22469.1.txt /t:__WriteToolsetLocation /warnaserror /__w/1/s/artifacts/toolset/restore.proj
/__w/1/s/eng/common/tools.sh: line 474:   535 Segmentation fault      (core dumped) "$_InitializeBuildTool" "$@"
Build failed with exit code 139. Check errors above.
Finishing: Build product

Similar perhaps to #43826 but that issue is quite old.

@AndyAyersMS AndyAyersMS merged commit c81e7a2 into dotnet:main Sep 22, 2022
@AndyAyersMS
Copy link
Member Author

@jeffschwMSFT @JulieLeeMSFT we should consider porting this to 7.0.

@jeffschwMSFT
Copy link
Member

@AndyAyersMS in what cases (beyond stress modes) do we anticipate that this may occur?

@AndyAyersMS
Copy link
Member Author

@AndyAyersMS in what cases (beyond stress modes) do we anticipate that this may occur?

This PR has a simple test case that fails without stress -- the value of last error will not be preserved across loops.

@jkotas
Copy link
Member

jkotas commented Sep 22, 2022

In a more real-world situation, the long-running for-loop in the test can be some kind of interop marshalling code.

@AndyAyersMS
Copy link
Member Author

/backport to release/7.0

@github-actions
Copy link
Contributor

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3130148038

@ghost ghost locked as resolved and limited conversation to collaborators Oct 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[libraries-pgo] Issues related to overlapped IO
6 participants