Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HTTP/3] Stress test regression #56310

Closed
CarnaViire opened this issue Jul 26, 2021 · 12 comments
Closed

[HTTP/3] Stress test regression #56310

CarnaViire opened this issue Jul 26, 2021 · 12 comments
Assignees
Milestone

Comments

@CarnaViire
Copy link
Member

CarnaViire commented Jul 26, 2021

Last success run was on 7/20
https://dev.azure.com/dnceng/public/_build/results?buildId=1248255&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=700

client_1  | [07/20/2021 14:15:10] Total: 8,238 Runtime: 00:31:00
client_1  | 	 0: GET                       Success: 543	Canceled: 42	Fail: 0
client_1  | 	 1: GET Partial               Success: 539	Canceled: 46	Fail: 0
client_1  | 	 2: GET Headers               Success: 549	Canceled: 37	Fail: 0
client_1  | 	 3: GET Parameters            Success: 532	Canceled: 55	Fail: 0
client_1  | 	 4: GET Aborted               Success: 549	Canceled: 37	Fail: 0
client_1  | 	 5: POST                      Success: 549	Canceled: 37	Fail: 0
client_1  | 	 6: POST Multipart Data       Success: 541	Canceled: 45	Fail: 0
client_1  | 	 7: POST Duplex               Success: 544	Canceled: 42	Fail: 0
client_1  | 	 8: POST Duplex Slow          Success: 544	Canceled: 42	Fail: 0
client_1  | 	 9: POST Duplex Dispose       Success: 547	Canceled: 39	Fail: 0
client_1  | 	10: POST ExpectContinue       Success: 0	Canceled: 43	Fail: 0
client_1  | 	11: HEAD                      Success: 527	Canceled: 57	Fail: 0
client_1  | 	12: PUT                       Success: 545	Canceled: 39	Fail: 0
client_1  | 	13: PUT Slow                  Success: 543	Canceled: 41	Fail: 0
client_1  | 	14: GET Slow                  Success: 536	Canceled: 48	Fail: 0
client_1  | 	    TOTAL                     Success: 7,588	Canceled: 650	Fail: 0

Regression seen from 7/21
https://dev.azure.com/dnceng/public/_build/results?buildId=1250670&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd
The log is big so it's not showing in UI, you'll need 'View raw log' button

It goes like this:

  1. First, there are some errors like Stream aborted by peer (4294967295) (means stream was disposed prematurely) or timeout or some other unexpected cancellation. Last statistic I was able to see before all requests will start to fail
2021-07-21T13:59:56.4253843Z client_1  | [07/21/2021 13:59:56] Total: 511,180 Runtime: 00:21:00
2021-07-21T13:59:56.4254938Z client_1  | 	 0: GET                       Success: 33,852	Canceled: 2,482	Fail: 0
2021-07-21T13:59:56.4255872Z client_1  | 	 1: GET Partial               Success: 33,948	Canceled: 2,386	Fail: 0
2021-07-21T13:59:56.4257107Z client_1  | 	 2: GET Headers               Success: 33,835	Canceled: 2,500	Fail: 0
2021-07-21T13:59:56.4258055Z client_1  | 	 3: GET Parameters            Success: 33,925	Canceled: 2,410	Fail: 1
2021-07-21T13:59:56.4260027Z client_1  | 	 4: GET Aborted               Success: 33,816	Canceled: 2,506	Fail: 14
2021-07-21T13:59:56.4260783Z client_1  | 	 5: POST                      Success: 33,840	Canceled: 2,494	Fail: 2
2021-07-21T13:59:56.4261487Z client_1  | 	 6: POST Multipart Data       Success: 33,839	Canceled: 2,496	Fail: 1
2021-07-21T13:59:56.4262369Z client_1  | 	 7: POST Duplex               Success: 33,914	Canceled: 2,422	Fail: 0
2021-07-21T13:59:56.4263120Z client_1  | 	 8: POST Duplex Slow          Success: 33,880	Canceled: 2,454	Fail: 2
2021-07-21T13:59:56.4263939Z client_1  | 	 9: POST Duplex Dispose       Success: 33,776	Canceled: 2,560	Fail: 0
2021-07-21T13:59:56.4264675Z client_1  | 	10: POST ExpectContinue       Success: 0	Canceled: 2,491	Fail: 0
2021-07-21T13:59:56.4265339Z client_1  | 	11: HEAD                      Success: 33,889	Canceled: 2,443	Fail: 2
2021-07-21T13:59:56.4265994Z client_1  | 	12: PUT                       Success: 33,855	Canceled: 2,477	Fail: 2
2021-07-21T13:59:56.4266650Z client_1  | 	13: PUT Slow                  Success: 33,641	Canceled: 2,691	Fail: 1
2021-07-21T13:59:56.4267299Z client_1  | 	14: GET Slow                  Success: 33,980	Canceled: 2,352	Fail: 1
2021-07-21T13:59:56.4267950Z client_1  | 	    TOTAL                     Success: 473,990	Canceled: 37,164	Fail: 26
  1. Then, come errors of aborting because of premature connection close. After that, as mentioned, no new requests would succeed.
  2. Then, come errors Connection has been shutdown by transport. Error Code: CONNECTION_IDLE
  3. Then, come erors Requesting HTTP version 3.0 with version policy RequestVersionExact while unable to establish HTTP/3 connection. and they are showing till the end.

(also note the numbers that are now significantly bigger than in #55810)

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jul 26, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Jul 26, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Last success run was on 7/20
https://dev.azure.com/dnceng/public/_build/results?buildId=1248255&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=700

client_1  | [07/20/2021 14:15:10] Total: 8,238 Runtime: 00:31:00
client_1  | 	 0: GET                       Success: 543	Canceled: 42	Fail: 0
client_1  | 	 1: GET Partial               Success: 539	Canceled: 46	Fail: 0
client_1  | 	 2: GET Headers               Success: 549	Canceled: 37	Fail: 0
client_1  | 	 3: GET Parameters            Success: 532	Canceled: 55	Fail: 0
client_1  | 	 4: GET Aborted               Success: 549	Canceled: 37	Fail: 0
client_1  | 	 5: POST                      Success: 549	Canceled: 37	Fail: 0
client_1  | 	 6: POST Multipart Data       Success: 541	Canceled: 45	Fail: 0
client_1  | 	 7: POST Duplex               Success: 544	Canceled: 42	Fail: 0

(strange that numbers are even lower than reported in #55810)

Regression seen from 7/21
https://dev.azure.com/dnceng/public/_build/results?buildId=1250670&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd
The log is big so it's not showing in UI, you'll need 'View raw log' button

It goes like this:

  1. First, there are some errors like Stream aborted by peer (4294967295) (means stream was disposed prematurely) or timeout or some other unexpected cancellation. Last statistic I was able to see before all requests will start to fail
2021-07-21T13:59:56.4253843Z client_1  | [07/21/2021 13:59:56] Total: 511,180 Runtime: 00:21:00
2021-07-21T13:59:56.4254938Z client_1  | 	 0: GET                       Success: 33,852	Canceled: 2,482	Fail: 0
2021-07-21T13:59:56.4255872Z client_1  | 	 1: GET Partial               Success: 33,948	Canceled: 2,386	Fail: 0
2021-07-21T13:59:56.4257107Z client_1  | 	 2: GET Headers               Success: 33,835	Canceled: 2,500	Fail: 0
2021-07-21T13:59:56.4258055Z client_1  | 	 3: GET Parameters            Success: 33,925	Canceled: 2,410	Fail: 1
2021-07-21T13:59:56.4260027Z client_1  | 	 4: GET Aborted               Success: 33,816	Canceled: 2,506	Fail: 14
2021-07-21T13:59:56.4260783Z client_1  | 	 5: POST                      Success: 33,840	Canceled: 2,494	Fail: 2
2021-07-21T13:59:56.4261487Z client_1  | 	 6: POST Multipart Data       Success: 33,839	Canceled: 2,496	Fail: 1
2021-07-21T13:59:56.4262369Z client_1  | 	 7: POST Duplex               Success: 33,914	Canceled: 2,422	Fail: 0
2021-07-21T13:59:56.4263120Z client_1  | 	 8: POST Duplex Slow          Success: 33,880	Canceled: 2,454	Fail: 2
2021-07-21T13:59:56.4263939Z client_1  | 	 9: POST Duplex Dispose       Success: 33,776	Canceled: 2,560	Fail: 0
2021-07-21T13:59:56.4264675Z client_1  | 	10: POST ExpectContinue       Success: 0	Canceled: 2,491	Fail: 0
2021-07-21T13:59:56.4265339Z client_1  | 	11: HEAD                      Success: 33,889	Canceled: 2,443	Fail: 2
2021-07-21T13:59:56.4265994Z client_1  | 	12: PUT                       Success: 33,855	Canceled: 2,477	Fail: 2
2021-07-21T13:59:56.4266650Z client_1  | 	13: PUT Slow                  Success: 33,641	Canceled: 2,691	Fail: 1
2021-07-21T13:59:56.4267299Z client_1  | 	14: GET Slow                  Success: 33,980	Canceled: 2,352	Fail: 1
2021-07-21T13:59:56.4267950Z client_1  | 	    TOTAL                     Success: 473,990	Canceled: 37,164	Fail: 26
  1. Then, come errors of aborting because of premature connection close. After that, as mentioned, no new requests would succeed.
  2. Then, come errors Connection has been shutdown by transport. Error Code: CONNECTION_IDLE
  3. Then, come erors Requesting HTTP version 3.0 with version policy RequestVersionExact while unable to establish HTTP/3 connection. and they are showing till the end.

(also note the numbers that are now bigger than in #55810)

Author: CarnaViire
Assignees: -
Labels:

area-System.Net.Http, untriaged

Milestone: -

@karelz karelz added bug and removed untriaged New issue has not been triaged by the area owner labels Jul 26, 2021
@karelz karelz added this to the 6.0.0 milestone Jul 26, 2021
@CarnaViire CarnaViire self-assigned this Jul 28, 2021
@CarnaViire
Copy link
Member Author

I realized that it's not really a regression, it's more like #55810 got fixed somehow. Before, we had super small numbers of requests and they didn't change since 1st minute of running. Now, we finally have more decent-ish numbers and that surfaced the issues we weren't able to see before.

I wonder if connection getting closed and then not opening again is yet another symptom of #55979

Sadly I'm unable to reproduce the problem (closed connection) locally even though it's the same docker container :( It seems I run out of ports earlier due to #56151, will need to check out after it gets fixed.

@ManickaP
Copy link
Member

Triage: no time to investigate yet, Natalia wasn't able to reproduce locally in the same container. If I don't reproduce it either, we might punt it.

@CarnaViire
Copy link
Member Author

This issue is now hidden by new issue that affects stress #57647 so it is not actionable at the moment.

@ManickaP
Copy link
Member

ManickaP commented Aug 25, 2021

2021-08-25T10:03:38.2115980Z client_1  | HttpStress Run Final Report
2021-08-25T10:03:38.2117020Z client_1  | 
2021-08-25T10:03:38.2117648Z client_1  | [08/25/2021 10:03:38] Total: 269,685 Runtime: 00:30:00
2021-08-25T10:03:38.2118567Z client_1  | 	 0: GET                       Success: 17,822	Canceled: 1,343	Fail: 5
2021-08-25T10:03:38.2119829Z client_1  | 	 1: GET Partial               Success: 17,875	Canceled: 1,293	Fail: 3
2021-08-25T10:03:38.2125441Z client_1  | 	 2: GET Headers               Success: 17,891	Canceled: 1,277	Fail: 4
2021-08-25T10:03:38.2126532Z client_1  | 	 3: GET Parameters            Success: 17,946	Canceled: 1,221	Fail: 6
2021-08-25T10:03:38.2127425Z client_1  | 	 4: GET Aborted               Success: 17,819	Canceled: 1,352	Fail: 2
2021-08-25T10:03:38.2199003Z client_1  | 	 5: POST                      Success: 17,908	Canceled: 1,262	Fail: 3
2021-08-25T10:03:38.2200149Z client_1  | 	 6: POST Multipart Data       Success: 17,875	Canceled: 1,293	Fail: 5
2021-08-25T10:03:38.2201200Z client_1  | 	 7: POST Duplex               Success: 17,849	Canceled: 1,321	Fail: 3
2021-08-25T10:03:38.2202237Z client_1  | 	 8: POST Duplex Slow          Success: 17,815	Canceled: 1,343	Fail: 15
2021-08-25T10:03:38.2203252Z client_1  | 	 9: POST Duplex Dispose       Success: 17,903	Canceled: 1,263	Fail: 7
2021-08-25T10:03:38.2204274Z client_1  | 	10: POST ExpectContinue       Success: 4	Canceled: 1,261	Fail: 11
2021-08-25T10:03:38.2205285Z client_1  | 	11: HEAD                      Success: 17,931	Canceled: 1,219	Fail: 23
2021-08-25T10:03:38.2206313Z client_1  | 	12: PUT                       Success: 17,854	Canceled: 1,312	Fail: 6
2021-08-25T10:03:38.2207315Z client_1  | 	13: PUT Slow                  Success: 17,645	Canceled: 1,503	Fail: 23
2021-08-25T10:03:38.2208349Z client_1  | 	14: GET Slow                  Success: 17,794	Canceled: 1,371	Fail: 4
2021-08-25T10:03:38.2209369Z client_1  | 	    TOTAL                     Success: 249,931	Canceled: 19,634	Fail: 120
2021-08-25T10:03:38.2210196Z client_1  | 
2021-08-25T10:03:38.2470243Z client_1  | Latency(ms) : n=269685, p50=3.73, p75=7.27, p99=64.32, p999=5551.17, max=60008.11

120 errors from 249 931 in 30 mins, looks much better now. I'll keep investigating the specific errors, but at least we now have some reasonable data.

EDIT: comparing to numbers from #55810, we still have 10%-20% throughput of H/1.1 | H/2...

@ManickaP
Copy link
Member

So the latest numbers (from #58442 - Expect 100 Continue excluded) are:

HttpStress Run Final Report
client_1  | 
client_1  | [08/31/2021 16:48:36] Total: 625,060 Runtime: 00:30:00
client_1  | 	 0: GET                       Success: 41,680	Canceled: 2,965	Fail: 0
client_1  | 	 1: GET Partial               Success: 41,699	Canceled: 2,946	Fail: 1
client_1  | 	 2: GET Headers               Success: 41,564	Canceled: 3,081	Fail: 2
client_1  | 	 3: GET Parameters            Success: 41,684	Canceled: 2,960	Fail: 4
client_1  | 	 4: GET Aborted               Success: 41,566	Canceled: 3,081	Fail: 1
client_1  | 	 5: POST                      Success: 41,747	Canceled: 2,899	Fail: 2
client_1  | 	 6: POST Multipart Data       Success: 41,620	Canceled: 3,028	Fail: 0
client_1  | 	 7: POST Duplex               Success: 41,639	Canceled: 3,008	Fail: 1
client_1  | 	 8: POST Duplex Slow          Success: 41,654	Canceled: 2,987	Fail: 7
client_1  | 	 9: POST Duplex Dispose       Success: 41,617	Canceled: 3,030	Fail: 1
client_1  | 	11: HEAD                      Success: 41,677	Canceled: 2,970	Fail: 1
client_1  | 	12: PUT                       Success: 41,673	Canceled: 2,975	Fail: 0
client_1  | 	13: PUT Slow                  Success: 40,938	Canceled: 3,699	Fail: 9
client_1  | 	14: GET Slow                  Success: 41,578	Canceled: 3,065	Fail: 1
client_1  | 	    TOTAL                     Success: 582,336	Canceled: 42,694	Fail: 30

Double the throughtput (582 k vs 249 k) and only 30 errors (down from 120) where I see 2 major types:

  • timeout (System.Threading.Tasks.TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 60 seconds elapsing.)
  • connection closed by the peer (System.Net.Quic.QuicConnectionAbortedException: Connection aborted by peer (258).)

I'll continue investigating, but in general it looks that the state is getting better and better 👏

@wfurt
Copy link
Member

wfurt commented Aug 31, 2021

I'm wondering if we can ever achieve reliable stress with unreliable listener.
But it seems like we can still make some progress.

@geoffkizer
Copy link
Contributor

I'm wondering if we can ever achieve reliable stress with unreliable listener.

If we can detect the "connection rejected" case that msquic does, then we can exclude it from test results.

I'm not sure if the "connection aborted" errors above are this or not... @ManickaP?

@ManickaP
Copy link
Member

ManickaP commented Sep 6, 2021

@karelz karelz modified the milestones: 6.0.0, 7.0.0 Sep 13, 2021
@ManickaP
Copy link
Member

ManickaP commented Oct 7, 2021

Report:

EDIT: I still haven't root cause the timeout. Timeout is probably an msquic bug: microsoft/msquic#2063

@ManickaP
Copy link
Member

I was able to get to the 30 minutes clean run locally. It'll take time to get to this state in the official pipeline. I'm closing this, since we don't have any regressions in stress anymore.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants