Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DNS cancellation deadlock #63904

Merged

Conversation

antonfirsov
Copy link
Member

As shown by #63552 (comment), in case the completion triggers before cancellation, GetAddrInfoExCancel does not return while the completion routine is running. This results in a deadlock if we are taking the same lock in completion and cancellation callbacks.

It looks to me that the lock is unnecessary, since GetAddrInfoExContext* is the only shared state, and pointer access is atomic.

As a prerequisite, I had to address #43816, so we can have reliable cancellation tests. I did this by serializing the tests, which eliminates parallel load as an instability factor causing races between CPU and IO-bound code.

Fixes #63552.
Resolves #43816. (If not, we can reopen.)

@ghost
Copy link

ghost commented Jan 17, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

As shown by #63552 (comment), in case the completion triggers before cancellation, GetAddrInfoExCancel does not return while the completion routine is running. This results in a deadlock if we are taking the same lock in completion and cancellation callbacks.

It looks to me that the lock is unnecessary, since GetAddrInfoExContext* is the only shared state, and pointer access is atomic.

As a prerequisite, I had to address #43816, so we can have reliable cancellation tests. I did this by serializing the tests, which eliminates parallel load as an instability factor causing races between CPU and IO-bound code.

Fixes #63552.
Resolves #43816. (If not, we can reopen.)

Author: antonfirsov
Assignees: -
Labels:

area-System.Net

Milestone: -

@antonfirsov

This comment has been minimized.

@antonfirsov antonfirsov requested a review from a team January 17, 2022 22:42
@azure-pipelines

This comment has been minimized.

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@antonfirsov

This comment has been minimized.

@azure-pipelines

This comment has been minimized.

@antonfirsov
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -409,17 +411,14 @@ public void RegisterForCancellation(CancellationToken cancellationToken)
var @this = (GetAddrInfoExState)o!;
int cancelResult = 0;

lock (@this)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use-after-free race condition here with context being freed at the end of ProcessResult?

Copy link
Member Author

@antonfirsov antonfirsov Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I overlooked this race. Looks like we should implement a more intrusive change adding some sort of ref-counting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karelz karelz added this to the 7.0.0 milestone Jan 25, 2022
@antonfirsov
Copy link
Member Author

antonfirsov commented Jan 31, 2022

@scalablecory @geoffkizer any idea why do we prefer unmanaged AllocHGlobal allocation instead of embedding GetAddrInfoExContext into GetAddrInfoExState and pinning the managed object?

var context = (GetAddrInfoExContext*)Marshal.AllocHGlobal(sizeof(GetAddrInfoExContext));

We allocate a GCHandle for GetAddrInfoExState anyways:

public IntPtr CreateHandle() => GCHandle.ToIntPtr(GCHandle.Alloc(this, GCHandleType.Normal));

It seems to me that the straightforward way to address #63904 (comment) is to introduce some reference counting. If it's reasonable to embed GetAddrInfoExContext into GetAddrInfoExState, the SafeHandle model doesn't fit this case. Is there an established way to implement custom ref counting in libraries? I see DeferredDisposableLifetime, but it's an internal of System.Private.CoreLib.

@geoffkizer
Copy link
Contributor

We allocate a GCHandle for GetAddrInfoExState anyways:

Yeah, but the GCHandle here is not pinned. Maybe it could be, but I'm not sure of the implications of that.

@antonfirsov
Copy link
Member Author

antonfirsov commented Jan 31, 2022

I realized that the GC cannot pin GetAddrInfoExState since it contains managed references. Addressed the use-after-free race condition by turning GetAddrInfoExState int a SafeHandle guarding GetAddrInfoExContext*. (a2a9355)

@scalablecory
Copy link
Contributor

@scalablecory @geoffkizer any idea why do we prefer unmanaged AllocHGlobal allocation instead of embedding GetAddrInfoExContext into GetAddrInfoExState and pinning the managed object?

Not sure why we use that instead of just pinning an array from the array pool. Hard to know how perf would change.

@jkotas
Copy link
Member

jkotas commented Feb 1, 2022

Not sure why we use that instead of just pinning an array from the array pool.

If you need a piece of pinned unmanaged memory, it is easier to allocate it using AllocHGlobal. The performance difference is negligible, also depends on what you measure. Pinning has negative impact on the overall GC performance that is hard to measure using microbenchmarks.

You can switch from AllocHGlobal to NativeMemory.Alloc/Free while you are on it. NativeMemory.Alloc/Free looks better and it is also a bit faster than AllocHGlobal.

@karelz
Copy link
Member

karelz commented Feb 8, 2022

@ManickaP can you please re-review latest changes? Thanks!

@@ -194,10 +194,10 @@ private static unsafe void GetAddressInfoExCallback(int error, int bytes, Native

private static unsafe void ProcessResult(SocketError errorCode, GetAddrInfoExContext* context)
{
GetAddrInfoExState state = GetAddrInfoExState.FromHandleAndFree(context->QueryStateHandle);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it is not inside try anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because state is being disposed now in the finally block, so it has to live out the try scope. FromHandleAndFree should only throw if there is a bug in our code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state can be just declared outside the try-finally and conditionally disposed, the same way as it's in GetAddrInfoAsync.
But I don't mind either way. If this throws, we have a bigger problem than missing Dispose call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't help unfortunately. If the FromHandleAndFree throws, no value will be assigned to state, so there will be nothing to dispose.

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@@ -194,10 +194,10 @@ private static unsafe void GetAddressInfoExCallback(int error, int bytes, Native

private static unsafe void ProcessResult(SocketError errorCode, GetAddrInfoExContext* context)
{
GetAddrInfoExState state = GetAddrInfoExState.FromHandleAndFree(context->QueryStateHandle);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state can be just declared outside the try-finally and conditionally disposed, the same way as it's in GetAddrInfoAsync.
But I don't mind either way. If this throws, we have a bigger problem than missing Dispose call.

@antonfirsov

This comment was marked as resolved.

@antonfirsov

This comment was marked as resolved.

@azure-pipelines

This comment was marked as resolved.

1 similar comment
@azure-pipelines

This comment was marked as duplicate.

@antonfirsov
Copy link
Member Author

OuterLoop failures are #65648.

@dotnet dotnet deleted a comment from azure-pipelines bot Feb 22, 2022
@ManickaP
Copy link
Member

ManickaP commented Mar 8, 2022

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@antonfirsov
Copy link
Member Author

The re-enabled test, DnsGetHostEntry_PostCancelledToken_Throws failed on Windows 7, which might be the result of flakiness experienced the weird win7 test environment. We may just disable the test on that OS.

Rerunning for now.

@antonfirsov
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).


// This is a regression test for https://github.com/dotnet/runtime/issues/63552
[Fact]
[ActiveIssue("https://github.com/dotnet/runtime/issues/33378", TestPlatforms.AnyUnix)] // Cancellation of an outstanding getaddrinfo is not supported on *nix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in #63902 we try to reduce the amount of workarounds and test-ignores if the regarding issue is closed.
#33378 is closed. So imo this can be removed

Suggested change
[ActiveIssue("https://github.com/dotnet/runtime/issues/33378", TestPlatforms.AnyUnix)] // Cancellation of an outstanding getaddrinfo is not supported on *nix.

Copy link
Member Author

@antonfirsov antonfirsov Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR #34633 that closed that issue did not add cancellation support and it has been reverted in #48666 in the end.

As a result [ActiveIssue] points to un outdated and closed issue and we are not tracking this problem properly today. This is something we should clean up, but I would prefer to address this later, and not to block this PR even longer.

@antonfirsov

This comment was marked as resolved.

@azure-pipelines

This comment was marked as resolved.

1 similar comment
@azure-pipelines

This comment was marked as resolved.

@antonfirsov
Copy link
Member Author

Test failures are unrelated, mostly #66803.

@antonfirsov antonfirsov merged commit 548b70d into dotnet:main Mar 21, 2022
@MartyIX
Copy link
Contributor

MartyIX commented Mar 22, 2022

Thank you for the bug fix. Happy to re-test once it's released. 🙏

@rzikm
Copy link
Member

rzikm commented Mar 29, 2022

/backport to release/6.0

@github-actions
Copy link
Contributor

Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/2058994928

radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022
Avoid taking a lock, and address the use-after-free race condition by guarding GetAddrInfoExContext with a SafeHandle.
@ghost ghost locked as resolved and limited conversation to collaborators Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dns.GetHostAddressesAsync gets stuck on Windows Move name resolution cancellation test into Docker
10 participants