-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QUIC: QuicStream.WaitForWriteCompletionAsync sometimes doesn't complete #71927
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsDescriptionI've noticed Kestrel sometimes throws an error when handling a gRPC request. See dotnet/aspnetcore#42289 tldr; the client sent a gRPC request, the server processed it and sent a response, but aborted the connection because the request is hung on WaitForWriteCompletionAsync, and thought the request was still in-progress and exceeded the timeout to drain the request. I investigated what is going on and it appears that most of the time in my scenario Reproduction Steps
Expected behaviorClient and server gracefully complete request. Actual behaviorHalf the time the QuicStream.WaitForWriteCompletionAsync task doesn't complete. The server aborts the request and the connection because they're unhealthy. Logs of success and failure are below. Success logs:
Failure logs:
The request then hangs, waiting on the task that doesn't complete. After 5 seconds Kestrel aborts the connection which calls QuicConnection.CloseAsync which unsticks Logs after abort:
Regression?Unknown Known WorkaroundsNo response ConfigurationLatest .NET 7 daily build Other informationNo response
|
BTW fixing this is a high priority. The bug doesn't impact the HTTP/3 request because it has told the client it is complete, but 5 seconds later the HTTP/3 connection is aborted. |
Could you point me to the test code for this? Looks like |
Is this a recent regression or has it been there for a while? |
I first noticed it 23 days ago. I don't know if it happened before then. |
Triage: looks like reliability issue, let's investigate in 7.0. It should be fairly simple to reproduce locally. |
I can reproduce that, and indeed this is because SEND_SHUTDOWN_COMPLETE sometimes does not arrive to the server's stream. While digging through MsQuic logs, I noticed that client side didn't send ACK for server's FIN flag before destroying the stream: SERVER: sends FIN flag
CLIENT: receives the FIN flag datagram
...marks for ACK but decides not to do that immediately
...as now both of stream's directions are shut down, triggers last events
...after stream being destroyed, it seems that it forgot to send that delayed ACK to the peer.... it was scheduled for ~14:38:26.431 but stream was destroyed at 14:38:26.427 SERVER: did not hear from client, tries to retransmit FIN..... no response, it isn't even shown on client connection's logs (I guess it disregards datagrams addressed to already closed streams) Full MsQuic logs@nibanks does that look like MsQuic bug? |
@CarnaViire was the stream aborted? I see that |
Actually, after looking at the logs, it seems the client just stopped all together. It's not receiving any other packets. This is with a breakpoint attached to it? |
The client's stream is not aborted, at least from what I see/understand. This is without any breakpoints, just one run of a client application, it executes one request, receives and handles the response and then finishes. The following are .NET and gRPC level client logs from another run when the same problem occurred. Everything looks like receiving all the data with the following graceful shutdown of the stream. After stream is done, though, the connection stayed alive inside HttpClient and, I guess, just got killed after app finished. I wonder if that adds any problems? But that would not be the reason we got Ready=0 (whatever that means)
|
After all, this might be the same issue as #72196. EDIT: Or is that not possible to happen due to MsQuic thread model? |
I don't think it's the same here. Here we consume all the data, because we receive PEER_SEND_SHUTDOWN event, which would not arrive before the app level consumes everything the transport level received. At least that was my understanding. |
We receive all the stream events including stream shutdown, after which we safely close the stream and believe everything was done with it. Is there a way for the application to know that there's still unfinished work on the transport level related to the stream? Or maybe is it somehow possible to flush the fin related to the stream on stream's closing? @nibanks I wonder, does gRPC dispose HttpClient in the end, or does it stay there until it's destroyed on app exit? @JamesNK |
It's up to the app to dispose of the channel/HttpClient. In this case, the gRPC benchmark app isn't disposing HttpClient before the process exits. |
If you immediately terminate things in response to receiving a stream event, you're always going to risk not acknowledging your peer, resulting in a behavior like this. You can try to do a graceful connection shutdown, and if you use MsQuic we make sure to always ACK everything when we close, but that's not a required behavior (so other implementations won't do it). But, this is exactly why HTTP generally only closes on an idle timeout, to make sure everything was exchanged. |
We can't do that after a request, because we don't know whether the user is done or not. We can only do so when the user signaled us that they're done - meaning, on dispose - but as James mentions above, the app is free to decide not to do that and just exit, as soon as it gets the full response from the server
I think that's true for the server, but not for the client? |
HTTP is spec'ed to close on both sides based on the idle timeout.
The app is free to do anything, but that doesn't mean it's correct. This is not a bug in MsQuic. If you terminate the QUIC connection or process before we have a chance to ACK, then that is a bug on the upper layer. |
In the gRPC benchmark app the HTTP/3 request is completed gracefully. The client writes the message to the request body, and the server writes the response message including a complete notification. At that point shouldn't the client send a notification to the server that the write side of the stream is finished? Or is the problem that there is a race between the process terminating and sending the notification to the server? |
AFAICT, once the client gets indication of the stream/request being complete on its side, the client is tearing everything down immediately (and never sending anything to the server again). This results in the ACK not getting sent to the server and the server never learns the stream was completed. |
We've discussed offline that while there's no ideal solution without ruining the perf, flushing acks on stream closing (connection-wide) should greatly reduce the race condition window. |
Change as was is in microsoft/msquic#2904 (flush on stream's Shutdown) didn't fully help, because it might happen that the ACK isn't even scheduled yet by the time Shutdown executes. However, moving flush to stream's Close fixed the problem. I don't see the hang on server anymore (if i dispose the stream, but don't dispose the connection) -- neither in gRPC nor in my small Quic-only repro. @nibanks mentioned that there still might be a problem if Close is called from within Shutdown callback (we don't do that in S.N.Quic, we call Close from managed thread after Shutdown callback unblocks TCS) Next steps: check perf impact (unless @nibanks believes there has to be a more complicated change; if so, I will hold back until it's available) |
Triage: We had a meeting where we decided the scenario is fairly corner-case (Kestrel users won't see anything, unless they enable Diagnostics). Moving to 8.0. If we have customer complains from the wild, we can always service .NET. |
Triage: we need to create an issue in MsQuic repo and link the PR microsoft/msquic#2904 there + mention that in order for it to work, the code should be in Close, not Shutdown (see comment above). |
Triage: the last comment stands. |
Did we created new issue @CarnaViire or is this something that was fixed in 2.2? I cannot quite tell and I'm not sure what is next step here. |
Hello |
Created microsoft/msquic#3842 |
Description
I've noticed Kestrel sometimes throws an error when handling a gRPC request. See dotnet/aspnetcore#42289
tldr; the client sent a gRPC request, the server processed it and sent a response, but aborted the connection because the request is hung on WaitForWriteCompletionAsync, and thought the request was still in-progress and exceeded the timeout to drain the request.
I investigated what is going on and it appears that most of the time in my scenario
QuicStream.WaitForWriteCompletionAsync
never returns. It appears to work reliably if there is a breakpoint before awaiting on a task, but not if there is a breakpoint after awaiting the task.Reproduction Steps
git clone https://github.com/jamesnk/grpc-dotnet
git checkout jamesnk/net7
dotnet run -c Release --project .\perf\benchmarkapps\GrpcAspNetCoreServer\ -- --LogLevel Trace --protocol h3
dotnet run -c Release --project .\perf\benchmarkapps\GrpcClient\ -- -u https://localhost:5000 -c 1 --streams 1 -s unary -p h3 --grpcClientType grpcnetclient --callCount 1
Expected behavior
Client and server gracefully complete request.
Actual behavior
Half the time the QuicStream.WaitForWriteCompletionAsync task doesn't complete. The server aborts the request and the connection because they're unhealthy.
Logs of success and failure are below.
Success logs:
Failure logs:
The request then hangs, waiting on the task that doesn't complete. After 5 seconds Kestrel aborts the connection which calls QuicConnection.CloseAsync which unsticks
WaitForWriteCompletionAsync
and the task completes.Logs after abort:
Regression?
Unknown
Known Workarounds
No response
Configuration
Latest .NET 7 daily build
Other information
No response
The text was updated successfully, but these errors were encountered: