Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test timeout related to min connected peers #1443

Conversation

stana-miric
Copy link
Contributor

The problem with timeout in tests when waiting for cluster start-up is related to not enough min connected peers due to a dialing problem. Before starting the polybft consensus protocol, there is a check if at least 2 peers are connected before proceeding with the initialization. When dialing peers, the dial can fail for several reasons, e.g.

  • the identity protocol is not yet registered on the peer we want to connect to (this is not unusual since in test the servers in the cluster are started at the same time)
    "unable to create new identity client connection, protocols not supported: [/id/0.1]"
  • for some reason, dial timeout occurs. Since we don't set the timeout explicitly when creating the p2p host, a default dial timeout is used which libp2p sets to 15s, and it should be quite enough (this should happen very rarely unless there is some issue with the machine)
    "failed to negotiate security protocol: context deadline exceeded"

If this happens there is a redial mechanism which should handle this and after that peers will be eventually connected. However, during multiple tests executing when analyzing this problem, there is a doubt that a bug in the process of discovering peers and adding them to dialing queue exists. Analyzing the logs from this one case (3 nodes in total), it can be seen that the node connected to 1 peer but could not connect to another peer. Later, dialing that other peer is not called because the dial queue was empty, even though the peer should be added when running peer discovery. It seems that peer was present in the routing table, but it was disconnected, also when dialing fails in regular cases we see the logs for removing the peer info ("Attempted removing missing peer") which will probably enable adding it to the dialing queue and it this problematic scenario these logs are missing. Since the problem should be analyzed more detailed (if possible with higher reproducibility), and there is a task in which the parallel peer dialing should be implemented which should enhance peers connection time, this problematic scenario will be explained in that task and logs attached.
Through this PR the following is done:

  • timeout for cluster startup is increased to 1 min, this can help if dialing timeout occurs in a way that it will be more time available for redialing
  • number of nodes in test that are failing because of this specific issue are increased (for allow list from 3 to 5 and for property tests min nodes is set to 5 as well), this can be helpful if it takes time for discover protocol to add peer to the queue or if the above described problematic case occurs the tolerance of bad nodes count will be greater with more nodes
    Probably we should review the condition for number of min peers condition when waiting before starting consensus protocol (minSyncPeers = 2 in polybft.go) and (compare it to?) minimum peer connections used for keepAliveMinimumPeerConnections (MinimumPeerConnections = 1 in server.go)

Changes include

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)

Checklist

  • I have assigned this PR to myself
  • I have added at least 1 reviewer
  • I have added the relevant labels
  • I have updated the official documentation
  • I have added sufficient documentation in code

Testing

  • I have tested this code with the official test suite
  • I have tested this code manually

@Stefan-Ethernal Stefan-Ethernal merged commit 852f44a into develop May 1, 2023
@Stefan-Ethernal Stefan-Ethernal deleted the EVM-615-fix-cluster-startup-related-timeout-in-e-2-e-test branch May 1, 2023 18:15
@github-actions github-actions bot locked and limited conversation to collaborators May 1, 2023
Copy link
Contributor

@vcastellm vcastellm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants