Resolve network deadlock due to reorgs in low sync high latency setup. #19239

Zergity · 2019-03-08T03:21:51Z

ROOT CAUSE

Consensuses with high rate of re-orgs are not properly handled by geth. Nodes are failed to switch between 2 disconnected forks if the parent of the better chain head has not been imported.

REPRODUCE STEPS

The issue happens occasionally and randomly, here are some factors to increase the chance of reproduction:

Shorter blocktime (1s)
Low connectivity (use --maxpeers=3 for Clique network of 4/6)
High latency and loss package: use Linux tool: "tc qdisc netem" to configure the network for this. (delay 0-2s, block lost rate = 0.1-0.9%)

SOLUTION

When syncing with other peer (usually the best peer), in case the remote chain is worse than our local chain, we should propagate our better chain back to the network, or atleast the target peer.
When importing a block with unknown parent, node should request (from the same peer) the parent block instead of discard the original block.
Recommend: consistent chain comparision rule. (This is consensus specific so it's not included in this PR)

RESULT

The fix has been tested in extreme condition, which the deadlock can easily happen before:

Clique blocktime 1s, 4/6 nodes.
Network jittering latency 0-1000ms.

No deadlock.

Increasing the latency sometime can cause the last peer (of n/2+1) dropped failling response to ping/pong message. But it eventually will be connected and the network continue. No deadlock so far.

holiman · 2019-08-22T09:01:16Z

Sorry this hasn't received a review yet. The reason behind that is that the fetcher/downloader are complex beasts, and this PR makes large changes to how that works.
A better approach, IMO, would be to start with a testcase that reproduces the problem. That testcase could be made in a separate PR, and then we could more easily reason about what's happening, and experiment with the proper way to solve this.

ivica7 · 2019-10-11T10:30:21Z

Is this only related to Clique PoA or does it affect PoW too?

karalabe · 2019-10-29T10:39:50Z

eth/sync.go

+	if cmp := pTd.Cmp(td); cmp <= 0 {
+		if cmp < 0 {
+			// propagate our better chain back to peers
+			go pm.BroadcastBlock(currentBlock, true)


This is a big no-no. It propagates entire blocks to everyone because one peer is out of sync. I don't think we should "think" in place of our peers. We announce in the handshake what our TD is and also do it on every block.

If the block is known by a peer, BroadcastBlock will skip them, so I don't think it's a problem. Beside, another solution is to broadcast to only that peer instead of all peers.

fjl · 2019-10-29T10:45:13Z

We discussed this change during a PR triage session and have come to the conclusion that this isn't a good fix for the issue. The fetcher / downloader need to change in a fundamental way to handle all announcement scenarios correctly.

It would still be very good to have an automated reproducer for this issue.

Zergity · 2019-11-08T04:13:25Z

We discussed this change during a PR triage session and have come to the conclusion that this isn't a good fix for the issue. The fetcher / downloader need to change in a fundamental way to handle all announcement scenarios correctly.

It would still be very good to have an automated reproducer for this issue.

Thank you for reviewing it. I will try to create an automated reproducer when the I have the time.

stale · 2020-12-25T13:09:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Zergity added 2 commits March 7, 2019 22:58

eth: queue for parent block while fetching a block with unknown parent

0d71b2f

eth: if our chain is better, propagate it back to peers

9a05d3e

Zergity requested a review from karalabe as a code owner March 8, 2019 03:21

hadv mentioned this pull request Mar 13, 2019

PoA network, all the sealers are waiting for each other after 2 months running, possible deadlock? #18402

Closed

eth/fetcher: schedule missing parent block synchronously

0736bc9

Zergity requested review from holiman and rjl493456442 as code owners August 19, 2019 03:37

adamschmideg added the status:triage label Aug 21, 2019

adamschmideg added pr:change-requested and removed status:triage labels Aug 29, 2019

adamschmideg added the status:triage label Oct 3, 2019

karalabe reviewed Oct 29, 2019

View reviewed changes

karalabe removed the status:triage label Nov 5, 2019

adamschmideg added the status:backlog label Nov 5, 2019

stale bot added the status:inactive label Dec 25, 2020

holiman closed this Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve network deadlock due to reorgs in low sync high latency setup. #19239

Resolve network deadlock due to reorgs in low sync high latency setup. #19239

Zergity commented Mar 8, 2019 •

edited

Loading

holiman commented Aug 22, 2019

ivica7 commented Oct 11, 2019

karalabe Oct 29, 2019

Zergity Nov 8, 2019

fjl commented Oct 29, 2019

Zergity commented Nov 8, 2019

stale bot commented Dec 25, 2020

Resolve network deadlock due to reorgs in low sync high latency setup. #19239

Resolve network deadlock due to reorgs in low sync high latency setup. #19239

Conversation

Zergity commented Mar 8, 2019 • edited Loading

ROOT CAUSE

REPRODUCE STEPS

SOLUTION

RESULT

holiman commented Aug 22, 2019

ivica7 commented Oct 11, 2019

karalabe Oct 29, 2019

Choose a reason for hiding this comment

Zergity Nov 8, 2019

Choose a reason for hiding this comment

fjl commented Oct 29, 2019

Zergity commented Nov 8, 2019

stale bot commented Dec 25, 2020

Zergity commented Mar 8, 2019 •

edited

Loading