-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Evict inactive peers from SyncingEngine
#13829
Conversation
If both halves of the block announce notification stream have been inactive for 2 minutes, report the peer and disconnect it, allowing `SyncingEngine` to free up a slot for some other peer that hopefully is more active. This needs to be done because the node may falsely believe it has open connections to peers because the inbound substream can be closed without any notification and closed outbound substream is noticed only when node attempts to write to it which may not happen if the node has nothing to send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to change the logic a little bit. When the chain is stalled, we should not start evicting all peers. So, we should check that if there are not any block announcements in INACTIVITY_EVICT_THRESHOLD
we do not evict all peers. I mean it could happen that we are only connected to peers that disconnected us. So, we should only evict some of the peers and if the new peers start sending us block announcements, we can also start evicting the old inactive peers.
I'm not sure if I agree. I'd argue that if we are observing a stalled chain, that is because of the peers we're connected to right now and their inactivity is the direct cause of a stalled chain for us. In this case what is best course of action is to evict all of the idle ones. What happens when Also notice that the process described above is happening constantly during syncing so idle peers are being swapped out for new potential peers that may be active, meaning this process would ensure that when we reach the tip, we would have nodes that have given us new block announcements within the last 2 minutes. Furthermore, always evicting all idle peers irrespective of whether the chain is stalled or not, without having special case of preserving observed idle peers has another benefit: since we're constantly monitoring the connections and remove idling peers which are then put back into All that said, there is one downside to this change that I can think of: we're already having issues with too many full nodes on the network. After this change, each node will be "scanning" the network for free full slots so if previously the node was happy keeping a node in its peer table while it was actually rejected, now the node will evict that node and look for a node that has a free full slot. The good thing is that the chain should be kept up to date, albeit with a delay, because of new blocks discovered in the handshake message but the downside is that if there previously was issues with full node slots being occupied, this change will ensure they will occupied. |
Btw today I managed to get the sync stalling issue reproduce by running a validator long enough. At some point it starts idling and both best and finalized block are stuck. The reason is because it's not receiving any block announcements as the inbound substream is closed and there's no way to detect that right now. I think it also links to time of the day because when I've done these tests later in the evening/night, everything works fine, probably because there are fewer people running nodes. Applying the inactivity fix and decreasing the threshold to 30 seconds works better than 2 minutes. It kicks out all 40 peers few times but then it finds some peers that are able to provide blocks. Interestingly, it looks like providing |
Yeah that is a good point! That should be enough to even work when the chain stalls. It may takes a little bit more time to keep everyone syncing again, but that should be neglect-able. |
Thanks for running the burn-in. I'll do one more test in my local environment and I think we can merge this after that. |
bot rebase |
Rebased |
bot merge |
* Backport paritytech/substrate#13829 and paritytech/substrate#13941 * Bump node version to v0.1.24-2
* Evict inactive peers from `SyncingEngine` If both halves of the block announce notification stream have been inactive for 2 minutes, report the peer and disconnect it, allowing `SyncingEngine` to free up a slot for some other peer that hopefully is more active. This needs to be done because the node may falsely believe it has open connections to peers because the inbound substream can be closed without any notification and closed outbound substream is noticed only when node attempts to write to it which may not happen if the node has nothing to send. * zzz * wip * Evict peers only when timeout expires * Use `debug!()` --------- Co-authored-by: parity-processbot <>
If both halves of the block announce notification stream have been inactive for 2 minutes, report the peer and disconnect it, allowing
SyncingEngine
to free up a slot for some other peer that hopefully is more active.This needs to be done because the node may falsely believe it has open connections to peers because the inbound substream can be closed without any notification and closed outbound substream is noticed only when node attempts to write to it which may not happen if the node has nothing to send.
This is a change we probably want to test somewhere before releasing, in case it has adverse effects not observable in my n = 1 dataset.
From logs:
Best block and libp2p peers after 30 minutes of running: