Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP/3 connection or stream shutdown is slow #77139

Closed
BrennanConroy opened this issue Oct 18, 2022 · 5 comments · Fixed by #104035
Closed

HTTP/3 connection or stream shutdown is slow #77139

BrennanConroy opened this issue Oct 18, 2022 · 5 comments · Fixed by #104035
Assignees
Labels
area-System.Net.Quic in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Milestone

Comments

@BrennanConroy
Copy link
Member

Description

gRPC HTTP/3 benchmark performance drastically decreased. It looks like something with stream/connection shutdown is taking forever, which affects RPS calculation. (And slow shutdown doesn't seem like a good thing regardless of RPS calculations)

Configuration

Linux citrine perf machine, crank commands below

Regression?

Yes, something in the following changes caused HTTP/3 gRPC tests to be really slow (shutdown takes forever which causes the RPS calculation to be longer)
b7fca07...1cb2f42

Data

Changing only the runtime version used by the client shows a difference in RPS.

Before the changes:

~3.8m RPS

crank command
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/ci.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile intel-lin-app --profile intel-load-load --variable scenario=serverstreaming --variable requestSize=0 --variable responseSize=0 --variable streams=70 --variable connections=20 --variable threads=20 --variable protocol=h3 --application.framework net7.0 --load.framework net7.0 --application.options.collectCounters true --load.runtimeVersion 7.0.0-preview.7.22362.12

After the changes:

~300k RPS

crank command
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/ci.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile intel-lin-app --profile intel-load-load --variable scenario=serverstreaming --variable requestSize=0 --variable responseSize=0 --variable streams=70 --variable connections=20 --variable threads=20 --variable protocol=h3 --application.framework net7.0 --load.framework net7.0 --application.options.collectCounters true --load.runtimeVersion 7.0.0-rc.1.22363.10

Analysis

I narrowed down the changes in runtime to b7fca07...1cb2f42 which contains some Quic API changes that may be involved in the regression.

@ghost
Copy link

ghost commented Oct 18, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

gRPC HTTP/3 benchmark performance drastically decreased. It looks like something with stream/connection shutdown is taking forever, which affects RPS calculation. (And slow shutdown doesn't seem like a good thing regardless of RPS calculations)

Configuration

Linux citrine perf machine, crank commands below

Regression?

Yes, something in the following changes caused HTTP/3 gRPC tests to be really slow (shutdown takes forever which causes the RPS calculation to be longer)
b7fca07...1cb2f42

Data

Changing only the runtime version used by the client shows a difference in RPS.

Before the changes:

~3.8m RPS

crank command
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/ci.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile intel-lin-app --profile intel-load-load --variable scenario=serverstreaming --variable requestSize=0 --variable responseSize=0 --variable streams=70 --variable connections=20 --variable threads=20 --variable protocol=h3 --application.framework net7.0 --load.framework net7.0 --application.options.collectCounters true --load.runtimeVersion 7.0.0-preview.7.22362.12

After the changes:

~300k RPS

crank command
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/build/ci.profile.yml --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/grpc.benchmarks.yml --scenario grpcaspnetcoreserver-grpcnetclient --profile intel-lin-app --profile intel-load-load --variable scenario=serverstreaming --variable requestSize=0 --variable responseSize=0 --variable streams=70 --variable connections=20 --variable threads=20 --variable protocol=h3 --application.framework net7.0 --load.framework net7.0 --application.options.collectCounters true --load.runtimeVersion 7.0.0-rc.1.22363.10

Analysis

I narrowed down the changes in runtime to b7fca07...1cb2f42 which contains some Quic API changes that may be involved in the regression.

Author: BrennanConroy
Assignees: -
Labels:

tenet-performance, area-System.Net.Quic

Milestone: -

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 18, 2022
@ManickaP
Copy link
Member

It looks like something with stream/connection shutdown is taking forever, which affects RPS calculation.

@BrennanConroy how did you come to this suspicion? Just to help us avoid rediscovering a wheel when looking at this.

@ManickaP ManickaP self-assigned this Oct 18, 2022
@ManickaP
Copy link
Member

I ran both crank commands and the first one (the fast one from before changes) is full of errors:
http://asp-citrine-load:5001/jobs/304/output

[12:36:56.133] 8 16: Error message: Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error reading next message. IOException: An error occurred while sending the request. HttpRequestException: An error occurred while sending the request. ArgumentOutOfRangeException: Specified argument was out of the range of valid values.", DebugException="System.IO.IOException: An error occurred while sending the request.
 ---> System.Net.Http.HttpRequestException: An error occurred while sending the request.
 ---> System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.
   at System.Net.Http.Http3RequestStream.ReadFrameEnvelopeAsync(CancellationToken cancellationToken)
   at System.Net.Http.Http3RequestStream.ReadNextDataFrameAsync(HttpResponseMessage response, CancellationToken cancellationToken)
   at System.Net.Http.Http3RequestStream.ReadResponseContentAsync(HttpResponseMessage response, Memory`1 buffer, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   --- End of inner exception stack trace ---
   at System.Net.Http.Http3RequestStream.HandleReadResponseContentException(Exception ex, CancellationToken cancellationToken)
   at System.Net.Http.Http3RequestStream.ReadResponseContentAsync(HttpResponseMessage response, Memory`1 buffer, CancellationToken cancellationToken)
   at Grpc.Net.Client.StreamExtensions.ReadMessageAsync[TResponse](Stream responseStream, GrpcCall call, Func`2 deserializer, String grpcEncoding, Boolean singleMessage, CancellationToken cancellationToken) in /tmp/benchmarks-agent/benchmarks-server-1/vtymw0ce.wkz/grpc-dotnet/src/Grpc.Net.Client/Internal/StreamExtensions.cs:line 188
   at Grpc.Net.Client.Internal.HttpContentClientStreamReader`2.MoveNextCore(CancellationToken cancellationToken) in /tmp/benchmarks-agent/benchmarks-server-1/vtymw0ce.wkz/grpc-dotnet/src/Grpc.Net.Client/Internal/HttpContentClientStreamReader.cs:line 223")
   at Grpc.Net.Client.Internal.HttpContentClientStreamReader`2.MoveNextCore(CancellationToken cancellationToken) in /tmp/benchmarks-agent/benchmarks-server-1/vtymw0ce.wkz/grpc-dotnet/src/Grpc.Net.Client/Internal/HttpContentClientStreamReader.cs:line 223
   at GrpcClient.Program.ServerStreamingCall(CancellationTokenSource cts, Int32 connectionId, Int32 streamId) in /tmp/benchmarks-agent/benchmarks-server-1/vtymw0ce.wkz/grpc-dotnet/perf/benchmarkapps/GrpcClient/Program.cs:line 680

So I don't think the tests are comparing the same thing.

Have you seen this error as well?

@ManickaP ManickaP removed their assignment Oct 18, 2022
@BrennanConroy
Copy link
Member Author

BrennanConroy commented Oct 18, 2022

how did you come to this suspicion?

The test should be ~15 seconds, but you can see from the output that it takes 5+ minutes to print "Stopping job 'load'", but it's almost immediate with the "fast" run. So something is slower about the close sequence.

Have you seen this error as well?

Interesting, I had only seen it once or twice in the output, not dozens of times.

@ManickaP ManickaP self-assigned this Oct 18, 2022
@ManickaP ManickaP removed the untriaged New issue has not been triaged by the area owner label Oct 25, 2022
@ManickaP ManickaP added this to the 8.0.0 milestone Oct 25, 2022
@ManickaP
Copy link
Member

So the culprit is QuicStream.Dispose. Before 7.0 RC (including 6.0) we only kicked off SHUTDOWN (or made sure is in progress) process of the stream. After the change, we adopted DisposeAsync and included waiting on SHUTDOWN_COMPLETE. This very negatively affects the server streaming scenario, as the client is reading data without an end, until the gRPC call gets cancelled (= calling Dispose on the HTTP/3 response --> and on the QuicStream).

When I changed the Dispose behavior to not to await on SHUTDOWN_COMPLETE, the perf shot up (2.9 mill RPS) even way past 6.0 perf (1.25 mil RPS) - tested locally.

However, I'm not 100% sure this is the best solution (I'm wary of issues like #71927), so we'll discuss it in the team before committing to anything.

ManickaP added a commit to ManickaP/runtime that referenced this issue Jan 16, 2023
Fixes dotnet#77139

Includes additional logging to check for leaks and freed GC handled, remove them before publishing.
@karelz karelz modified the milestones: 8.0.0, 9.0.0 Jul 18, 2023
@ManickaP ManickaP removed their assignment Feb 13, 2024
@liveans liveans self-assigned this Apr 9, 2024
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jun 26, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Quic in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants