-
Notifications
You must be signed in to change notification settings - Fork 20.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve network deadlock due to reorgs in low sync high latency setup. #19239
Conversation
Sorry this hasn't received a review yet. The reason behind that is that the fetcher/downloader are complex beasts, and this PR makes large changes to how that works. |
Is this only related to Clique PoA or does it affect PoW too? |
if cmp := pTd.Cmp(td); cmp <= 0 { | ||
if cmp < 0 { | ||
// propagate our better chain back to peers | ||
go pm.BroadcastBlock(currentBlock, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big no-no. It propagates entire blocks to everyone because one peer is out of sync. I don't think we should "think" in place of our peers. We announce in the handshake what our TD is and also do it on every block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the block is known by a peer, BroadcastBlock will skip them, so I don't think it's a problem. Beside, another solution is to broadcast to only that peer instead of all peers.
We discussed this change during a PR triage session and have come to the conclusion that this isn't a good fix for the issue. The fetcher / downloader need to change in a fundamental way to handle all announcement scenarios correctly. It would still be very good to have an automated reproducer for this issue. |
Thank you for reviewing it. I will try to create an automated reproducer when the I have the time. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
To fix #18402 and #16406
ROOT CAUSE
Consensuses with high rate of re-orgs are not properly handled by geth. Nodes are failed to switch between 2 disconnected forks if the parent of the better chain head has not been imported.
REPRODUCE STEPS
The issue happens occasionally and randomly, here are some factors to increase the chance of reproduction:
SOLUTION
RESULT
The fix has been tested in extreme condition, which the deadlock can easily happen before:
No deadlock.
Increasing the latency sometime can cause the last peer (of n/2+1) dropped failling response to ping/pong message. But it eventually will be connected and the network continue. No deadlock so far.