Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve network deadlock due to reorgs in low sync high latency setup. #19239

Closed
wants to merge 3 commits into from

Conversation

Zergity
Copy link

@Zergity Zergity commented Mar 8, 2019

To fix #18402 and #16406

ROOT CAUSE

Consensuses with high rate of re-orgs are not properly handled by geth. Nodes are failed to switch between 2 disconnected forks if the parent of the better chain head has not been imported.

REPRODUCE STEPS

The issue happens occasionally and randomly, here are some factors to increase the chance of reproduction:

  • Shorter blocktime (1s)
  • Low connectivity (use --maxpeers=3 for Clique network of 4/6)
  • High latency and loss package: use Linux tool: "tc qdisc netem" to configure the network for this. (delay 0-2s, block lost rate = 0.1-0.9%)

SOLUTION

  • When syncing with other peer (usually the best peer), in case the remote chain is worse than our local chain, we should propagate our better chain back to the network, or atleast the target peer.
  • When importing a block with unknown parent, node should request (from the same peer) the parent block instead of discard the original block.
  • Recommend: consistent chain comparision rule. (This is consensus specific so it's not included in this PR)

RESULT

The fix has been tested in extreme condition, which the deadlock can easily happen before:

  • Clique blocktime 1s, 4/6 nodes.
  • Network jittering latency 0-1000ms.

image
No deadlock.

Increasing the latency sometime can cause the last peer (of n/2+1) dropped failling response to ping/pong message. But it eventually will be connected and the network continue. No deadlock so far.
image

@holiman
Copy link
Contributor

holiman commented Aug 22, 2019

Sorry this hasn't received a review yet. The reason behind that is that the fetcher/downloader are complex beasts, and this PR makes large changes to how that works.
A better approach, IMO, would be to start with a testcase that reproduces the problem. That testcase could be made in a separate PR, and then we could more easily reason about what's happening, and experiment with the proper way to solve this.

@ivica7
Copy link

ivica7 commented Oct 11, 2019

Is this only related to Clique PoA or does it affect PoW too?

if cmp := pTd.Cmp(td); cmp <= 0 {
if cmp < 0 {
// propagate our better chain back to peers
go pm.BroadcastBlock(currentBlock, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big no-no. It propagates entire blocks to everyone because one peer is out of sync. I don't think we should "think" in place of our peers. We announce in the handshake what our TD is and also do it on every block.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the block is known by a peer, BroadcastBlock will skip them, so I don't think it's a problem. Beside, another solution is to broadcast to only that peer instead of all peers.

@fjl
Copy link
Contributor

fjl commented Oct 29, 2019

We discussed this change during a PR triage session and have come to the conclusion that this isn't a good fix for the issue. The fetcher / downloader need to change in a fundamental way to handle all announcement scenarios correctly.

It would still be very good to have an automated reproducer for this issue.

@Zergity
Copy link
Author

Zergity commented Nov 8, 2019

We discussed this change during a PR triage session and have come to the conclusion that this isn't a good fix for the issue. The fetcher / downloader need to change in a fundamental way to handle all announcement scenarios correctly.

It would still be very good to have an automated reproducer for this issue.

Thank you for reviewing it. I will try to create an automated reproducer when the I have the time.

@stale
Copy link

stale bot commented Dec 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@holiman holiman closed this Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PoA network, all the sealers are waiting for each other after 2 months running, possible deadlock?
6 participants