Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TestFailure] Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow fails on ARM #82769

Closed
rzikm opened this issue Feb 28, 2023 · 10 comments · Fixed by #83687
Labels
area-System.Net.Quic disabled-test The test is disabled in source code against the issue test-failure
Milestone

Comments

@rzikm
Copy link
Member

rzikm commented Feb 28, 2023

e.g. https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-96bd05a3f91a4661b1/System.Net.Quic.Functional.Tests/3/console.60e17a73.log?helixlogtype=result

Outerloop test. There are about 180 failures in last 2 weeks (as of 28-02-2023)

----- start Tue 28 Feb 2023 12:18:50 PM UTC =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet exec --runtimeconfig System.Net.Quic.Functional.Tests.runtimeconfig.json --depsfile System.Net.Quic.Functional.Tests.deps.json xunit.console.dll System.Net.Quic.Functional.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing 
popd
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Net.Quic.Functional.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Net.Quic.Functional.Tests (found 8 of 122 test cases)
  Starting:    System.Net.Quic.Functional.Tests (parallel test collections = on, max threads = 2)
    System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow(backlogLimit: 100, connectCount: 99) [FAIL]
      Assert.Equal() Failure
      Expected: 99
      Actual:   98
      Stack Trace:
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(250,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(267,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        --- End of stack trace from previous location ---
      Output:
        Using libmsquic.so version=2.1.7.329711 commit=2db78a2cd72d0c95111d5f4884fbc17307a4238d
    System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow(backlogLimit: 100, connectCount: 101) [FAIL]
      Assert.Equal() Failure
      Expected: 100
      Actual:   98
      Stack Trace:
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(244,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(267,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        --- End of stack trace from previous location ---
      Output:
        Using libmsquic.so version=2.1.7.329711 commit=2db78a2cd72d0c95111d5f4884fbc17307a4238d
  Finished:    System.Net.Quic.Functional.Tests
Kusto query
cluster('engsrvprod.kusto.windows.net').database('engineeringdata').AzureDevOpsTests
    //| where TestName has 'ConnectWithRevocation_WithCallback'
    | where TestName startswith "System.Net"
        and Outcome == 'Failed' or Outcome == 'PassedOnRerun'
    | extend i = indexof_regex(TestName, @"[^.]+$")
    | extend Method = substring(TestName, i)
    | extend Type = substring(TestName, 0, i - 1)
    | project-away i, TestName
    | where Method has 'Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow'
    //| where Arguments == 'protocol: Ssl3'
    | distinct JobId, WorkItemId, Message, StackTrace, Method, Type, Arguments, Outcome
    | join kind=inner (cluster('engsrvprod.kusto.windows.net').database('engineeringdata').Jobs
        | where Finished > now(-14d)
        | extend TargetBranch = extractjson("$.['System.PullRequest.TargetBranch']", Properties)
        | where (Branch == 'refs/heads/main') //or (Source startswith "pr/" and TargetBranch == 'main')
        | where Type startswith "test/functional/cli/" and not(Properties contains "runtime-staging")
        | summarize arg_max(Finished, Properties, Type, Branch, Source, Started, QueueName, Attempt, Build, TargetBranch) by JobId
    | project-rename JobType = Type) on JobId
    | extend PropertiesJson = parse_json(Properties)
    | extend OS = replace_regex(tostring(PropertiesJson.operatingSystem), @'\((.*)\).*|([^\(].*)', @'\1\2')
    | extend Runtime = iif(PropertiesJson.runtimeFlavor == "mono", "Mono", iif(PropertiesJson.DefinitionName contains "coreclr", "CoreCLR", "")) 
    | extend Architecture = PropertiesJson.architecture
    | extend DefinitionName = PropertiesJson.DefinitionName
    | project-away JobId
    | extend JobId = extractjson("$.['System.JobId']", Properties)
    //| where JobId == '913fd104-6683-4129-8fd6-02f22466d821'
    //| where Architecture == 'arm64'
    //| where QueueName == 'windows.amd64.server2022.open.svc'
    //| where TargetBranch == 'release/6.0'
    | join kind=inner (cluster('engsrvprod.kusto.windows.net').database('engineeringdata').WorkItems
        | project-keep ConsoleUri, WorkItemId) on WorkItemId
    | project-away PropertiesJson
    | project-reorder Started, Type, Method, Arguments, StackTrace, ConsoleUri, OS, Architecture, DefinitionName, Branch, TargetBranch, Attempt, Build, * 
;
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 28, 2023
@ghost
Copy link

ghost commented Feb 28, 2023

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

e.g. https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-96bd05a3f91a4661b1/System.Net.Quic.Functional.Tests/3/console.60e17a73.log?helixlogtype=result

----- start Tue 28 Feb 2023 12:18:50 PM UTC =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet exec --runtimeconfig System.Net.Quic.Functional.Tests.runtimeconfig.json --depsfile System.Net.Quic.Functional.Tests.deps.json xunit.console.dll System.Net.Quic.Functional.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing 
popd
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Net.Quic.Functional.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Net.Quic.Functional.Tests (found 8 of 122 test cases)
  Starting:    System.Net.Quic.Functional.Tests (parallel test collections = on, max threads = 2)
    System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow(backlogLimit: 100, connectCount: 99) [FAIL]
      Assert.Equal() Failure
      Expected: 99
      Actual:   98
      Stack Trace:
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(250,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(267,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        --- End of stack trace from previous location ---
      Output:
        Using libmsquic.so version=2.1.7.329711 commit=2db78a2cd72d0c95111d5f4884fbc17307a4238d
    System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Slow(backlogLimit: 100, connectCount: 101) [FAIL]
      Assert.Equal() Failure
      Expected: 100
      Actual:   98
      Stack Trace:
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(244,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        /_/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(267,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        --- End of stack trace from previous location ---
      Output:
        Using libmsquic.so version=2.1.7.329711 commit=2db78a2cd72d0c95111d5f4884fbc17307a4238d
  Finished:    System.Net.Quic.Functional.Tests

There are about 120 failures in last 2 weeks (as of 28-02-2023)

Author: rzikm
Assignees: -
Labels:

area-System.Net.Quic

Milestone: -

@ManickaP
Copy link
Member

ManickaP commented Feb 28, 2023

AzureDevOpsTests
| where TestName has "Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows" and Outcome == "Failed" and RunCompleted >= ago(1day)
| order by BuildDefinitionName, QueueName, TestRunName, RunCompleted
| distinct TestRunName

Gives me this list:

@ManickaP ManickaP self-assigned this Feb 28, 2023
@ManickaP ManickaP removed the untriaged New issue has not been triaged by the area owner label Feb 28, 2023
@ManickaP ManickaP added this to the 8.0.0 milestone Feb 28, 2023
@ManickaP ManickaP added the disabled-test The test is disabled in source code against the issue label Mar 2, 2023
@ManickaP ManickaP modified the milestones: 8.0.0, Future Mar 7, 2023
@ManickaP
Copy link
Member

ManickaP commented Mar 7, 2023

Moving discussion:

@ManickaP:

This is another occurrence of #55979, i.e. msquic is silently dropping connections due to the machine being too slow.
I've temporarily added statistics (6709e8e) and they show multiple dropped connections.
I'm moving this to future as we don't have any solution for this problem. Maybe we could even close this against #55979

@wfurt:

That is pretty sad that we cannot make single test reliable. I'm wondering if we can use the counter to adjust the expectation or decrease the concurrency for ARM. (or tune the drop threshold)

@nibanks:

You can't run a single test without failing? From my previous understanding, QUIC was getting overwhelmed by too much getting queued, and it was (by design) aborting the connection because it could no longer provide the time guarantees to have a reliable connection. But this wasn't caused by just 1 test.

@wfurt:

In this particular test, there is single listener and many parallel attempts to negotiate connection - this test was added recently to verify handling accept queue inside .NET.

@wfurt:

Failed for me on x64 as well while running tests on my VM over the weekend

    System.Net.Quic.Tests.MsQuicPlatformDetectionTests.UnsupportedPlatforms_ThrowsPlatformNotSupportedException [SKIP]
      Condition(s) not met: "IsQuicUnsupported"
      Assert.Equal() Failure
      Expected: 100
      Actual:   94
      Stack Trace:
        /home/furt/github/wfurt-runtime/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(251,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)
        /home/furt/github/wfurt-runtime/src/libraries/System.Net.Quic/tests/FunctionalTests/QuicListenerTests.cs(268,0): at System.Net.Quic.Tests.QuicListenerTests.Listener_BacklogLimitRefusesConnection_ParallelClients_ClientThrows_Core(Int32 backlogLimit, Int32 connectCount)

@ManickaP
Copy link
Member

ManickaP commented Mar 7, 2023

So I have experimentally increased MaxWorkerQueueDelayUs in 561d61a and it helped. We might consider doing this for the tests only. Any thoughts @wfurt @CarnaViire @nibanks?

@nibanks
Copy link

nibanks commented Mar 7, 2023

I'd really like to understand the scenario more. That commit is essentially saying to QUIC that it's ok to be stuck for 2.5 seconds before sending an ACK to the peer, even though we've told them we will respond within 25 ms. This will result in the peer in sending a retransmit after the 25 ms (likely multiple times) and will result in different test behavior. Best case, the number of packets required is a little different. Worst case, depending on the configuration, the connection is killed for a different reason ("disconnect timeout').

@ManickaP
Copy link
Member

ManickaP commented Mar 7, 2023

We're testing parallel connection attempts to make sure our logic to limit the number of incoming, established connections, that haven't been accepted by the user yet, never outgrows the configured limit. So this test introduces parallel processing, bane of all Quic tests apparently, which in this case is necessary for the test to test what it should. Once this test gets run on "slow" enough system, it starts failing due to connections being discarded by msquic and the observed counts of accepted and rejected connections not corresponding to the expected numbers.

We have encountered this problem multiple times in our tests and it keeps resurfacing. Setting up this parameter is the only way to prevent msquic from discarding connections as far as I understand. I just want to make our tests are more stable. If you know of better way to achieve that, I'll gladly jump on that.

@wfurt
Copy link
Member

wfurt commented Mar 7, 2023

I'm wondering if we can get MsQuic trace and/or packet captures to verify that the logic works as expected.

@nibanks
Copy link

nibanks commented Mar 7, 2023

Can you simply use a smaller limit at your layer on these slower machines? As we've discussed before, MsQuic is operating as expected. It dynamically responds to the capabilities of the machines. A 2-CPU VM obviously isn't going to be able to handle thousands of handshakes per second. What are the exact numbers you are testing, both in terms of connections and CPUs? Based on our perf tests (no .NET included 😄) our perf machines can only handle about 250 connections/per core/per second, best case. I know .NET does more complicated things on top compared to our perf tests, and with VMs (not baremetal) you're likely to have even lower limits.

@ManickaP
Copy link
Member

ManickaP commented Mar 7, 2023

100 parallel connections are made at the beginning of the test and handshake is successfully finished for all of them. We have other combinations of numbers of attempted and accepted connections, but those are the ones that were failing.

EDIT: handshake should be finished for them, but some are discarded before that.

@ManickaP
Copy link
Member

ManickaP commented Mar 7, 2023

And we have tests with lower numbers as well. Now we have to somehow figure out where and when suppress those more demanding, as well as other tests like #83101.

@ManickaP ManickaP removed their assignment Mar 9, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 20, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 24, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Apr 23, 2023
@karelz karelz removed this from the Future milestone May 27, 2023
@karelz karelz added this to the 8.0.0 milestone May 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Quic disabled-test The test is disabled in source code against the issue test-failure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants