Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HTTP/3] Stress test failures for GET Aborted scenario #72619

Closed
CarnaViire opened this issue Jul 21, 2022 · 18 comments
Closed

[HTTP/3] Stress test failures for GET Aborted scenario #72619

CarnaViire opened this issue Jul 21, 2022 · 18 comments
Assignees
Labels
area-System.Net.Http bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.) tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Milestone

Comments

@CarnaViire
Copy link
Member

Most of the failures (~399) are about scenario not getting an exception it expected to get from the server:

client_1  | System.Exception: Completed unexpectedly
client_1  |    at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_4>d.MoveNext() in /app/ClientOperations.cs:line 313
client_1  | --- End of stack trace from previous location ---
client_1  |    at HttpStress.StressClient.<>c__DisplayClass17_0.<<StartCore>g__RunWorker|0>d.MoveNext() in /app/StressClient.cs:line 224

A small part of failures (~5) is:

client_1  | System.Net.Http.HttpRequestException: The server returned an invalid or unrecognized response.
client_1  |    at System.Net.Http.Http3RequestStream.ReadResponseAsync(CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3RequestStream.cs:line 337
client_1  |    at System.Net.Http.Http3RequestStream.SendAsync(CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3RequestStream.cs:line 304
client_1  |    at System.Net.Http.Http3RequestStream.SendAsync(CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3RequestStream.cs:line 304
client_1  |    at System.Net.Http.Http3Connection.SendAsync(HttpRequestMessage request, Int64 queueStartingTimestamp, CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3Connection.cs:line 245
client_1  |    at System.Net.Http.HttpConnectionPool.TrySendUsingHttp3Async(HttpRequestMessage request, CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs:line 974
client_1  |    at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs:line 1116
client_1  |    at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/RedirectHandler.cs:line 85
client_1  |    at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/HttpClient.cs:line 558
client_1  |    at HttpStress.RequestContext.SendAsync(HttpRequestMessage request, HttpCompletionOption httpCompletion, Nullable`1 token) in /app/ClientOperations.cs:line 108
client_1  |    at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_4>d.MoveNext() in /app/ClientOperations.cs:line 313
client_1  | --- End of stack trace from previous location ---
client_1  |    at HttpStress.StressClient.<>c__DisplayClass17_0.<<StartCore>g__RunWorker|0>d.MoveNext() in /app/StressClient.cs:line 224

Last Main run: https://dev.azure.com/dnceng/public/_build/results?buildId=1890195&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=1632
Most recent run (on #72601) https://dev.azure.com/dnceng/public/_build/results?buildId=1895244&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=1596

@ghost
Copy link

ghost commented Jul 21, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Most of the failures (~399) are about scenario not getting an exception it expected to get from the server:

client_1  | System.Exception: Completed unexpectedly
client_1  |    at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_4>d.MoveNext() in /app/ClientOperations.cs:line 313
client_1  | --- End of stack trace from previous location ---
client_1  |    at HttpStress.StressClient.<>c__DisplayClass17_0.<<StartCore>g__RunWorker|0>d.MoveNext() in /app/StressClient.cs:line 224

A small part of failures (~5) is:

client_1  | System.Net.Http.HttpRequestException: The server returned an invalid or unrecognized response.
client_1  |    at System.Net.Http.Http3RequestStream.ReadResponseAsync(CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3RequestStream.cs:line 337
client_1  |    at System.Net.Http.Http3RequestStream.SendAsync(CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3RequestStream.cs:line 304
client_1  |    at System.Net.Http.Http3RequestStream.SendAsync(CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3RequestStream.cs:line 304
client_1  |    at System.Net.Http.Http3Connection.SendAsync(HttpRequestMessage request, Int64 queueStartingTimestamp, CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3Connection.cs:line 245
client_1  |    at System.Net.Http.HttpConnectionPool.TrySendUsingHttp3Async(HttpRequestMessage request, CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs:line 974
client_1  |    at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs:line 1116
client_1  |    at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/RedirectHandler.cs:line 85
client_1  |    at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) in /_/src/libraries/System.Net.Http/src/System/Net/Http/HttpClient.cs:line 558
client_1  |    at HttpStress.RequestContext.SendAsync(HttpRequestMessage request, HttpCompletionOption httpCompletion, Nullable`1 token) in /app/ClientOperations.cs:line 108
client_1  |    at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_4>d.MoveNext() in /app/ClientOperations.cs:line 313
client_1  | --- End of stack trace from previous location ---
client_1  |    at HttpStress.StressClient.<>c__DisplayClass17_0.<<StartCore>g__RunWorker|0>d.MoveNext() in /app/StressClient.cs:line 224

Last Main run: https://dev.azure.com/dnceng/public/_build/results?buildId=1890195&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=1632
Most recent run (on #72601) https://dev.azure.com/dnceng/public/_build/results?buildId=1895244&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=1596

Author: CarnaViire
Assignees: -
Labels:

area-System.Net.Http

Milestone: -

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 21, 2022
@ManickaP ManickaP added this to the 7.0.0 milestone Jul 25, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jul 25, 2022
@ManickaP
Copy link
Member

The first one is fixed with: #72734
The second one is: #72739
Closing this issue.

@CarnaViire
Copy link
Member Author

"Completed unexpectedly" still happens: Run #20220804.2 from yesterday (04.08)

@CarnaViire CarnaViire reopened this Aug 5, 2022
@ManickaP
Copy link
Member

ManickaP commented Aug 8, 2022

It started happening again between 3.8. and 4.8. There was a change in ASP 7.0.0-rc.1.22402.14 --> 7.0.0-rc.1.22403.3 (I don't see any changes in S.N.Quic that might have caused this).
I see that in dotnet/aspnetcore#43011 Abort and DisposeAsync on the server side stream were moved outside of _shutdownLock. @JamesNK is it possible now that DisposeAsync races with the Abort?

@JamesNK
Copy link
Member

JamesNK commented Aug 9, 2022

What do you expect to happen when a stream is aborted while QuicStream.DisposeAsync is being called? Is that not allowed?

Previously ASP.NET Core was calling QuicStream.Dispose. That method just calls QuicStream.DisposeAsync but with sync-over-async, which we obviously don't want. And a lock can't contain an await of an async method.

I can't repo getting "The server returned an invalid or unrecognized response." in gRPC benchmarks. However, none of them abort.

@ManickaP
Copy link
Member

ManickaP commented Aug 9, 2022

My theory is that HttpContext.Abort finished before QuicStream.Abort was actually called (that there's a disconnect between those two operations) and it races with QuicStream.DisposeAsync. But please correct me if I'm wrong. There're several layers of code in between and I might be missing something.

For completeness, our server code:

endpoints.MapGet("/abort", async context =>
{
// Server writes some content, then aborts the connection
string content = CreateResponseContent(context);
await context.Response.WriteAsync(content.Substring(0, content.Length / 2));
context.Abort();
});

Which should result in aborted stream on the client side.

From user perspective, calling Abort on HttpContext should not end before you actually call Abort on QuicStream.

@JamesNK
Copy link
Member

JamesNK commented Aug 10, 2022

HttpContext.Abort calls QuicStream.Abort inline (i.e. doesn't dispatch to a different thread)

image

DisposeAsync is always called after the request user code is complete. The only way they could conflict is by someone calling HttpContext.Abort on a different thread while the request is being cleaned up, and that's completely unsupported.

@JamesNK
Copy link
Member

JamesNK commented Aug 10, 2022

I wrote a test to see if I could duplicate it - dotnet/aspnetcore#43183. Runs fine with 1000 requests (at least on my machine)

@karelz
Copy link
Member

karelz commented Aug 22, 2022

Based on discussion with @ManickaP - this could be caused by QuicStream on server sending success (incl. FIN flag) when server code aborted the stream and did not send full response yet. As a result, the response on the wire appears to be complete and successful even when it is incomplete from server code point of view.

@karelz karelz added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Aug 22, 2022
@CarnaViire
Copy link
Member Author

Could it happen that TrySetException would for some reason unexpectedly return false here and we will end up never issuing an Abort? And subsequent Dispose will end up closing the stream gracefully...

if (_sendTcs.TrySetException(ThrowHelper.GetOperationAbortedException(SR.net_quic_writing_aborted), final: true))
{
flags |= QUIC_STREAM_SHUTDOWN_FLAGS.ABORT_SEND;
}

Note that abort for read side currently will happen regardless of TrySetException result, though it might be a result of some copy-paste error, because the code adds the flag twice:

flags |= QUIC_STREAM_SHUTDOWN_FLAGS.ABORT_RECEIVE;
if (_receiveTcs.TrySetException(ThrowHelper.GetOperationAbortedException(SR.net_quic_reading_aborted), final: true))
{
flags |= QUIC_STREAM_SHUTDOWN_FLAGS.ABORT_RECEIVE;
}

@ManickaP
Copy link
Member

So I logged this with some added details in QuicStream, including before/after calling msquic and DisposeAsync is not called in parallel, but CompleteWrite seems to be:

2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7291100[Info] thisOrContextObject: MsQuicContextSafeHandle#35032069, memberName: .ctor, message: [strm][0x7F88B802D5D0] MsQuicSafeHandle created
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7291267[Info] thisOrContextObject: MsQuicContextSafeHandle#35032069, memberName: .ctor, message: [strm][0x7F88B802D5D0] [conn][0x7F88A8000B70] ref count incremented
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7291376[Info] thisOrContextObject: QuicStream#44760329, memberName: .ctor, message: [strm][0x7F88B802D5D0] Stream 15120 create.
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7291641[Info] thisOrContextObject: QuicStream#44760329, memberName: ReadAsync, message: [strm][0x7F88B802D5D0] Stream reading into memory of '4096' bytes.
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7291893[Info] thisOrContextObject: QuicStream#44760329, memberName: NativeCallback, message: [strm][0x7F88B802D5D0] Received event RECEIVE
2022-08-25 17:26:15.729 +02:00 [INF] Request starting HTTP/3 GET https://localhost:5001/abort - -
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7292110[Info] thisOrContextObject: QuicStream#44760329, memberName: NativeCallback, message: [strm][0x7F88B802D5D0] Received event PEER_SEND_SHUTDOWN
2022-08-25 17:26:15.729 +02:00 [DBG] 1 candidate(s) found for the request path '/abort'
2022-08-25 17:26:15.729 +02:00 [DBG] Stream id "0HMK6LOIDHFIB:00003B10" type "Bidirectional" accepted.
2022-08-25 17:26:15.729 +02:00 [DBG] Request matched endpoint 'HTTP: GET /abort'
2022-08-25 17:26:15.729 +02:00 [INF] Executing endpoint 'HTTP: GET /abort'
2022-08-25 17:26:15.729 +02:00 [INF] Request starting HTTP/3 GET https://localhost:5001/abort - -
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7294208[Info] thisOrContextObject: QuicStream#44760329, memberName: WriteAsync, message: [strm][0x7F88B802D5D0] Stream writing memory of '118' bytes while not completing writes.
2022-08-25 17:26:15.729 +02:00 [DBG] Trace id "0HMK6LOIDHFIB:00003B10": HTTP/3 stream error "H3_INTERNAL_ERROR". An abort is being sent to the stream.
Microsoft.AspNetCore.Connections.ConnectionAbortedException: The connection was aborted by the application.
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7294606[Info] thisOrContextObject: QuicStream#44760329, memberName: NativeCallback, message: [strm][0x7F88B802D5D0] Received event SEND_COMPLETE
2022-08-25 17:26:15.729 +02:00 [DBG] Stream id "0HMK6LOIDHFIB:00003B10" aborted by application with error code 258 because: "The connection was aborted by the application.".

// QuicStream.Abort started
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7295309[Info] thisOrContextObject: QuicStream#44760329, memberName: Abort, message: [strm][0x7F88B802D5D0] Abort(Read, 258)
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7295491[Info] thisOrContextObject: QuicStream#44760329, memberName: Abort, message: [strm][0x7F88B802D5D0] Abort(Write, 258)
2022-08-25 17:26:15.729 +02:00 [DBG] Stream id "0HMK6LOIDHFIB:00003B10" shutting down writes because: "The connection was aborted by the application.".

// QuicStream.CompleteWrites started
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7298211[Info] thisOrContextObject: QuicStream#44760329, memberName: CompleteWrites, message: [strm][0x7F88B802D5D0] CompleteWrites(15120)

// msquic graceful shutdown called and returned
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7298452[Info] thisOrContextObject: QuicStream#44760329, memberName: CompleteWrites, message: [strm][0x7F88B802D5D0] CompleteWrites() call
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7298700[Info] thisOrContextObject: QuicStream#44760329, memberName: CompleteWrites, message: [strm][0x7F88B802D5D0] CompleteWrites() called

// msquic abort called and returned
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7297337[Info] thisOrContextObject: QuicStream#44760329, memberName: Abort, message: [strm][0x7F88B802D5D0] Abort(ABORT_SEND, 258) call
2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7299125[Info] thisOrContextObject: QuicStream#44760329, memberName: Abort, message: [strm][0x7F88B802D5D0] Abort(ABORT_SEND, 258) called

2022-08-25 17:26:15.729 +02:00 [DBG] 15:26:15.7299668[Info] thisOrContextObject: QuicStream#44760329, memberName: NativeCallback, message: [strm][0x7F88B802D5D0] Received event SEND_SHUTDOWN_COMPLETE
2022-08-25 17:26:15.729 +02:00 [INF] Request finished HTTP/3 GET https://localhost:5001/abort - - - 200 - - 0.6862ms
2022-08-25 17:26:15.730 +02:00 [DBG] 15:26:15.7299979[Info] thisOrContextObject: QuicStream#44760329, memberName: NativeCallback, message: [strm][0x7F88B802D5D0] Received event SHUTDOWN_COMPLETE
2022-08-25 17:26:15.730 +02:00 [DBG] 15:26:15.7300276[Info] thisOrContextObject: MsQuicContextSafeHandle#35032069, memberName: ReleaseHandle, message: [strm][0x7F88B802D5D0] MsQuicSafeHandle released
2022-08-25 17:26:15.730 +02:00 [DBG] 15:26:15.7300503[Info] thisOrContextObject: MsQuicContextSafeHandle#35032069, memberName: ReleaseHandle, message: [strm][0x7F88B802D5D0] [conn][0x7F88A8000B70] ref count decremented

CompleteWrites call seems to be coming from:
https://github.com/dotnet/aspnetcore/blob/51ccce6d743dddf995b3661f07b0e1990b68dc7d/src/Servers/Kestrel/Transport.Quic/src/Internal/QuicStreamContext.cs#L532-L538

And the PR I mentioned before: dotnet/aspnetcore#43011 moved calling Abort outside of the lock which enabled the race to happen.

@JamesNK could you look into it? Maybe just not calling CompleteWrites in case you have an abort reason and are going to abort would suffice?

@JamesNK
Copy link
Member

JamesNK commented Aug 25, 2022

Kestrel could move the abort back into a lock if it's needed.

My question is, should QuicStream be making these checks? i.e. CompleteWrites no-ops if an abort has happened. Or throws an error. It doesn't seem like a good user experience to require all calling code to have locks when QuicStream could do it itself. Is QuicStream intended to be thread-safe?

@rzikm
Copy link
Member

rzikm commented Aug 26, 2022

Is QuicStream intended to be thread-safe?

I believe the answer is no. But in some cases we try to detect it in order not to corrupt the internal state (like overlapping Read|WriteAsync

My question is, should QuicStream be making these checks? i.e. CompleteWrites no-ops if an abort has happened.

We could check if the relevant TaskCompletionSource was completed with an exception and short-circuit the method, but there would still be a race IMO, unless we introduce a lock internally and impose a possible perf hit.

@ManickaP
Copy link
Member

i.e. CompleteWrites no-ops if an abort has happened.

That's what happens internally in msquic, we do not prevent the call though. But if CompleteWrites arrives after Abort it won't do anything.

And I don't think it's even in QuicStream ability to prevent this race. If there's possibility of having CompleteWrites or WriteAsync(completeWrites: true) run in parallel with Abort, no amount of locking in QuicStream will make 100% sure that Abort gets called first.

If I'm reading the code of AbortCore right:
https://github.com/dotnet/aspnetcore/blob/51ccce6d743dddf995b3661f07b0e1990b68dc7d/src/Servers/Kestrel/Core/src/Internal/Http3/Http3Stream.cs#L145-L184
It first completes the PipeWriter before calling QuicStream.Abort:
https://github.com/dotnet/aspnetcore/blob/51ccce6d743dddf995b3661f07b0e1990b68dc7d/src/Servers/Kestrel/Core/src/Internal/Http3/Http3Stream.cs#L170-L172

And the PipeWriter completion will asynchronously attempt to "complete the writes" gracefully.

@JamesNK do you want me to file an issue in your repo? Transfer this one?

@ManickaP ManickaP added the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label Aug 26, 2022
@JamesNK
Copy link
Member

JamesNK commented Aug 26, 2022

Log an issue please. Leave this here to confirm that changing asp.net core resolves the stress failures.

@JamesNK
Copy link
Member

JamesNK commented Sep 6, 2022

Made a change: dotnet/aspnetcore#43630

Does it have any impact?

@karelz
Copy link
Member

karelz commented Sep 6, 2022

@CarnaViire would you be able to check it?

@karelz karelz assigned CarnaViire and unassigned ManickaP Sep 6, 2022
@CarnaViire
Copy link
Member Author

I confirm that there are no such failures anymore, neither locally nor in CI.
HTTP/3 stress now is green in CI on both main and release/7.0 🥳
Thanks @JamesNK!

@ghost ghost locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Http bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.) tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

No branches or pull requests

5 participants