Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http.sys accept loop - mitigate against break due to possible conflicting IO callbacks #54368

Merged
merged 4 commits into from
Mar 11, 2024

Conversation

mgravell
Copy link
Member

@mgravell mgravell commented Mar 5, 2024

Context: #54251

System.InvalidOperationException: NativeOverlapped cannot be reused for multiple operations after .NET 6 to .NET 8 upgrade

This manifests as a fatal exception in the http.sys "accept" loop, which can either:

  1. fault in purely managed code, stopping the message pump (MessagePump.ExecuteAsync)
  2. fault during an IOCP callback, potentially killing the process

The underlying cause appears (hypothesis only) to be a duplicate callback, perhaps similar/related to the "large certificate" scenario (and mitigations) that impacted win8, and in particular this check somehow running both a sync and IOCP version of IOCompleted - this would break our expectations of a single caller into IOCompleted.

IOCompleted is intended to set the outcome of a value-task source; some unguarded SetResult/SetException (especially the final catch-all handler) mean that faults can indeed propagate, but this should not IMO be considered the main issue; rather, this is a symptom (albeit a fatal symptom) of being in an unexpected state, i.e. that we attempted to set the result while it wasn't actively pending.

To investigate and mitigate, this PR:

  1. strengthens the SetResult / SetException aspects of IOCompleted, so that any fault becomes transient rather than fatal
  2. adds logging of any such failure
  3. adds tracking of what completions we're expecting at any point
  4. logs any deviations from those expectations

This should a: help resolve the critical failures, and b: improve our understanding via logging if it does recur.

The use of PreAllocatedOverlapped on a reused AsyncAcceptContext means that we do not have the ability to change the IOCP callback state per call, which makes it impossible to do a reliable assert, but if we can prove the hypothesis, one additional mitigation option might be to remove the use of PreAllocatedOverlapped, using public System.Threading.NativeOverlapped* AllocateNativeOverlapped (System.Threading.IOCompletionCallback callback, object? state, object? pinData) per-accept rather than public System.Threading.NativeOverlapped* AllocateNativeOverlapped (System.Threading.PreAllocatedOverlapped preAllocated), with a different state per usage; this will have some additional overhead, but would allow stronger correctness guarantees. This would be an area to investigate if a: the logging shows that this is indeed what is happening, and b: we are unable to mitigate in other ways (similar to HttpSysListener.SkipIOCPCallbackOnSuccess).

Suggested additional testing here should include "large client certificates" (or just "large headers"?) to see if we can force the same failure mode, although this could be ambitious.

Historic context: prior to net6, there was a separate AsyncAcceptContext and TaskCompletionSource<> per accept; the use of TCS (with TrySet...) means any double-set would have not been observed, and the separate AsyncAcceptContext means that each IOCP callback could not ever get confused and try triggering the wrong state.

1. handle MRTVS cascading fault breaking the accept loop
2. log any expectation failures
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions label Mar 5, 2024
@mgravell mgravell changed the title investigate #54251 (more details will be in PR) http.sys accept loop - mitigate against break due to possible conflicting IO callbacks Mar 5, 2024
@mgravell
Copy link
Member Author

mgravell commented Mar 5, 2024

/azp run

@mgravell
Copy link
Member Author

mgravell commented Mar 5, 2024

(test failures are file copy problems)

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@mgravell
Copy link
Member Author

mgravell commented Mar 7, 2024

@halter73 @BrennanConroy @JamesNK @davidfowl @Tratcher could do with some eyeballs here (production fault)

@amcasey
Copy link
Member

amcasey commented Mar 7, 2024

CI failure looks like #53294 (i.e. known)

@davidfowl
Copy link
Member

This looks good.

…itching

- lock new critical log messages behind app-context switch
@mgravell
Copy link
Member Author

mgravell commented Mar 8, 2024

/backport to release/8.0

Copy link
Contributor

github-actions bot commented Mar 8, 2024

Started backporting to release/8.0: https://github.com/dotnet/aspnetcore/actions/runs/8204527822

Copy link
Contributor

github-actions bot commented Mar 8, 2024

@mgravell backporting to release/8.0 failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: investigate #54251 (more details will be in PR)
Using index info to reconstruct a base tree...
M	src/Servers/HttpSys/src/AsyncAcceptContext.cs
M	src/Servers/HttpSys/test/FunctionalTests/Listener/Utilities.cs
Falling back to patching base and 3-way merge...
Auto-merging src/Servers/HttpSys/test/FunctionalTests/Listener/Utilities.cs
Auto-merging src/Servers/HttpSys/src/AsyncAcceptContext.cs
CONFLICT (content): Merge conflict in src/Servers/HttpSys/src/AsyncAcceptContext.cs
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 investigate #54251 (more details will be in PR)
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

Copy link
Contributor

github-actions bot commented Mar 8, 2024

@mgravell an error occurred while backporting to release/8.0, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

@mgravell
Copy link
Member Author

mgravell commented Mar 8, 2024

/azp run

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@mgravell mgravell merged commit 032bbb9 into main Mar 11, 2024
26 checks passed
@mgravell mgravell deleted the marc/httpsys-handles branch March 11, 2024 10:56
@dotnet-policy-service dotnet-policy-service bot added this to the 9.0-preview3 milestone Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants