-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HTTP/3] Stress test failures for GET Aborted scenario #72619
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsMost of the failures (~399) are about scenario not getting an exception it expected to get from the server:
A small part of failures (~5) is:
Last Main run: https://dev.azure.com/dnceng/public/_build/results?buildId=1890195&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=1632
|
"Completed unexpectedly" still happens: Run #20220804.2 from yesterday (04.08) |
It started happening again between 3.8. and 4.8. There was a change in ASP 7.0.0-rc.1.22402.14 --> 7.0.0-rc.1.22403.3 (I don't see any changes in S.N.Quic that might have caused this). |
What do you expect to happen when a stream is aborted while Previously ASP.NET Core was calling I can't repo getting "The server returned an invalid or unrecognized response." in gRPC benchmarks. However, none of them abort. |
My theory is that For completeness, our server code: runtime/src/libraries/System.Net.Http/tests/StressTests/HttpStress/StressServer.cs Lines 241 to 247 in c99545c
Which should result in aborted stream on the client side. From user perspective, calling |
DisposeAsync is always called after the request user code is complete. The only way they could conflict is by someone calling |
I wrote a test to see if I could duplicate it - dotnet/aspnetcore#43183. Runs fine with 1000 requests (at least on my machine) |
Based on discussion with @ManickaP - this could be caused by QuicStream on server sending success (incl. FIN flag) when server code aborted the stream and did not send full response yet. As a result, the response on the wire appears to be complete and successful even when it is incomplete from server code point of view. |
Could it happen that TrySetException would for some reason unexpectedly return false here and we will end up never issuing an Abort? And subsequent Dispose will end up closing the stream gracefully... runtime/src/libraries/System.Net.Quic/src/System/Net/Quic/QuicStream.cs Lines 409 to 412 in 8a4de2e
Note that abort for read side currently will happen regardless of TrySetException result, though it might be a result of some copy-paste error, because the code adds the flag twice: runtime/src/libraries/System.Net.Quic/src/System/Net/Quic/QuicStream.cs Lines 401 to 405 in 8a4de2e
|
So I logged this with some added details in
And the PR I mentioned before: dotnet/aspnetcore#43011 moved calling @JamesNK could you look into it? Maybe just not calling |
Kestrel could move the abort back into a lock if it's needed. My question is, should QuicStream be making these checks? i.e. CompleteWrites no-ops if an abort has happened. Or throws an error. It doesn't seem like a good user experience to require all calling code to have locks when QuicStream could do it itself. Is QuicStream intended to be thread-safe? |
I believe the answer is no. But in some cases we try to detect it in order not to corrupt the internal state (like overlapping Read|WriteAsync
We could check if the relevant TaskCompletionSource was completed with an exception and short-circuit the method, but there would still be a race IMO, unless we introduce a lock internally and impose a possible perf hit. |
That's what happens internally in msquic, we do not prevent the call though. But if And I don't think it's even in If I'm reading the code of And the @JamesNK do you want me to file an issue in your repo? Transfer this one? |
Log an issue please. Leave this here to confirm that changing asp.net core resolves the stress failures. |
Made a change: dotnet/aspnetcore#43630 Does it have any impact? |
@CarnaViire would you be able to check it? |
I confirm that there are no such failures anymore, neither locally nor in CI. |
Most of the failures (~399) are about scenario not getting an exception it expected to get from the server:
A small part of failures (~5) is:
Last Main run: https://dev.azure.com/dnceng/public/_build/results?buildId=1890195&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=1632
Most recent run (on #72601) https://dev.azure.com/dnceng/public/_build/results?buildId=1895244&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=1596
The text was updated successfully, but these errors were encountered: