Evict inactive peers from `SyncingEngine` #13829

altonen · 2023-04-05T14:14:52Z

If both halves of the block announce notification stream have been inactive for 2 minutes, report the peer and disconnect it, allowing SyncingEngine to free up a slot for some other peer that hopefully is more active.

This needs to be done because the node may falsely believe it has open connections to peers because the inbound substream can be closed without any notification and closed outbound substream is noticed only when node attempts to write to it which may not happen if the node has nothing to send.

This is a change we probably want to test somewhere before releasing, in case it has adverse effects not observable in my n = 1 dataset.

From logs:

DEBUG tokio-runtime-worker sync: evict peer 12D3KooWCYdHoP38Jymy3UtVM2mVYcgLNAuMWMNzwsB16kEYQsTA since it has been idling for too long    
DEBUG tokio-runtime-worker sync: evict peer 12D3KooWHsvEicXjWWraktbZ4MQBizuyADQtuEGr3NbDvtm5rFA5 since it has been idling for too long    
DEBUG tokio-runtime-worker sync: 12D3KooWCYdHoP38Jymy3UtVM2mVYcgLNAuMWMNzwsB16kEYQsTA disconnected    
DEBUG tokio-runtime-worker sync: 12D3KooWHsvEicXjWWraktbZ4MQBizuyADQtuEGr3NbDvtm5rFA5 disconnected

Best block and libp2p peers after 30 minutes of running:

If both halves of the block announce notification stream have been inactive for 2 minutes, report the peer and disconnect it, allowing `SyncingEngine` to free up a slot for some other peer that hopefully is more active. This needs to be done because the node may falsely believe it has open connections to peers because the inbound substream can be closed without any notification and closed outbound substream is noticed only when node attempts to write to it which may not happen if the node has nothing to send.

client/network/sync/src/engine.rs

bkchr

We need to change the logic a little bit. When the chain is stalled, we should not start evicting all peers. So, we should check that if there are not any block announcements in INACTIVITY_EVICT_THRESHOLD we do not evict all peers. I mean it could happen that we are only connected to peers that disconnected us. So, we should only evict some of the peers and if the new peers start sending us block announcements, we can also start evicting the old inactive peers.

altonen · 2023-04-06T07:45:41Z

We need to change the logic a little bit. When the chain is stalled, we should not start evicting all peers. So, we should check that if there are not any block announcements in INACTIVITY_EVICT_THRESHOLD we do not evict all peers. I mean it could happen that we are only connected to peers that disconnected us. So, we should only evict some of the peers and if the new peers start sending us block announcements, we can also start evicting the old inactive peers.

I'm not sure if I agree. I'd argue that if we are observing a stalled chain, that is because of the peers we're connected to right now and their inactivity is the direct cause of a stalled chain for us. In this case what is best course of action is to evict all of the idle ones. What happens when SyncingEngine does that is Peerset will notice that the number of actual connections (both inbound/outbound) have fallen below the desired limit and it will start establishing/accepting new connections. So what happens is not that we see the best block not progressing, we kick the inactive peers, we see (0 peers) and best block keeps being stuck but that SyncingEngine get a set of fresh connections which some of them might actually be able to provide us with new block announcements. If even after this process peers are constantly getting swapped out because they're rejecting us, then it doesn't really matter anymore because it means that all nodes reachable to this node are full and no matter what is done, the chain will look as though it's stalled. Keeping some of them as backup doesn't help. The only thing I can think of that helps with that is to increase connectivity with, e.g., hole punching. You could also argue that the chain is kept up to date with some delay since during connection we'd resync from this node using the block advertised in BlockAnnouncesHandshake but that's just an extra feature.

Also notice that the process described above is happening constantly during syncing so idle peers are being swapped out for new potential peers that may be active, meaning this process would ensure that when we reach the tip, we would have nodes that have given us new block announcements within the last 2 minutes.

Furthermore, always evicting all idle peers irrespective of whether the chain is stalled or not, without having special case of preserving observed idle peers has another benefit: since we're constantly monitoring the connections and remove idling peers which are then put back into NotConnected state in Peerset, later on Peerset will try attempt to reconnect to this node again, at which point the peer might actually have a slot for us and is able to give block announcements. In other words, since the peer state is dynamic in the entire network, while a peer might have rejected us now as it's full, when we later on reconnect to it, it might actually have a slot for us since its peer state is updating independently from us and if we instead try to keep it as a back-up peer during a stalled chain, we're not able to reconnect and claim this potentially free slot for us.

All that said, there is one downside to this change that I can think of: we're already having issues with too many full nodes on the network. After this change, each node will be "scanning" the network for free full slots so if previously the node was happy keeping a node in its peer table while it was actually rejected, now the node will evict that node and look for a node that has a free full slot. The good thing is that the chain should be kept up to date, albeit with a delay, because of new blocks discovered in the handshake message but the downside is that if there previously was issues with full node slots being occupied, this change will ensure they will occupied.

altonen · 2023-04-10T14:36:20Z

Btw today I managed to get the sync stalling issue reproduce by running a validator long enough. At some point it starts idling and both best and finalized block are stuck. The reason is because it's not receiving any block announcements as the inbound substream is closed and there's no way to detect that right now. I think it also links to time of the day because when I've done these tests later in the evening/night, everything works fine, probably because there are fewer people running nodes.

Applying the inactivity fix and decreasing the threshold to 30 seconds works better than 2 minutes. It kicks out all 40 peers few times but then it finds some peers that are able to provide blocks.

Interestingly, it looks like providing --in-peers-light 0 fixes the sync issue for both nodes with and without the eviction patch applied, and the node is kept up to date for hours. I will keep testing to see if it produces consistent results or if it was just luck. If turns out to be a working fix, these sync issues are then probably related to paritytech/polkadot-sdk#512 and fixing how handshakes work could possibly be the fix for these issues people have been reporting.

client/network/sync/src/engine.rs

bkchr · 2023-04-13T12:37:10Z

You could also argue that the chain is kept up to date with some delay since during connection we'd resync from this node using the block advertised in BlockAnnouncesHandshake but that's just an extra feature.

Yeah that is a good point! That should be enough to even work when the chain stalls. It may takes a little bit more time to keep everyone syncing again, but that should be neglect-able.

vstakhov · 2023-04-19T11:28:54Z

After burning this PR on Versi we have found that it really helped to reduce peers count spikes and the inbound/outbound libp2p error rates:

altonen · 2023-04-19T11:33:48Z

Thanks for running the burn-in. I'll do one more test in my local environment and I think we can merge this after that.

altonen · 2023-04-21T07:48:12Z

bot rebase

paritytech-processbot · 2023-04-21T07:48:20Z

Rebased

altonen · 2023-04-21T08:22:17Z

bot merge

* Backport paritytech/substrate#13829 and paritytech/substrate#13941 * Bump node version to v0.1.24-2

* Evict inactive peers from `SyncingEngine` If both halves of the block announce notification stream have been inactive for 2 minutes, report the peer and disconnect it, allowing `SyncingEngine` to free up a slot for some other peer that hopefully is more active. This needs to be done because the node may falsely believe it has open connections to peers because the inbound substream can be closed without any notification and closed outbound substream is noticed only when node attempts to write to it which may not happen if the node has nothing to send. * zzz * wip * Evict peers only when timeout expires * Use `debug!()` --------- Co-authored-by: parity-processbot <>

altonen requested a review from a team April 5, 2023 14:14

zzz

f0c6c72

dmitry-markin approved these changes Apr 5, 2023

View reviewed changes

client/network/sync/src/engine.rs Outdated Show resolved Hide resolved

client/network/sync/src/engine.rs Outdated Show resolved Hide resolved

bkchr reviewed Apr 5, 2023

View reviewed changes

wip

ed79b5b

bkchr approved these changes Apr 13, 2023

View reviewed changes

client/network/sync/src/engine.rs Outdated Show resolved Hide resolved

altonen added 3 commits April 14, 2023 11:01

Evict peers only when timeout expires

9ddb96b

Merge remote-tracking branch 'origin/master' into sync-evict-idle-peers

a6d794b

Use debug!()

734c4f1

vstakhov added a commit to paritytech/polkadot that referenced this pull request Apr 18, 2023

Burn paritytech/substrate#13829

8541d7c

vstakhov mentioned this pull request Apr 18, 2023

[DNM] Versi burn for https://github.com/paritytech/substrate/pull/13829 paritytech/polkadot#7090

Open

altonen mentioned this pull request Apr 18, 2023

Cumulus sync performance #9360

Open

altonen mentioned this pull request Apr 20, 2023

Kusama nodes stop sync, p2p instability. paritytech/polkadot#6696

Closed

Merge remote-tracking branch 'origin/master' into sync-evict-idle-peers

f2071f3

paritytech-processbot bot merged commit e44b43e into master Apr 21, 2023

paritytech-processbot bot deleted the sync-evict-idle-peers branch April 21, 2023 08:22

jasl added a commit to Phala-Network/khala-parachain that referenced this pull request Apr 24, 2023

Backport paritytech/substrate#13829 and paritytech/substrate#13941

a1ce96a

jasl added a commit to Phala-Network/khala-parachain that referenced this pull request Apr 24, 2023

Backport Substrate #13829 and #13941 (#279)

3c5032b

* Backport paritytech/substrate#13829 and paritytech/substrate#13941 * Bump node version to v0.1.24-2

github-actions bot mentioned this pull request Jun 15, 2023

Update substrate/polkadot/cumulus from v0.9.40 to v0.9.43 moonbeam-foundation/moonbeam#2354

Closed

kacperzuk-neti mentioned this pull request Jun 23, 2023

Polkadot v0.9.43 liberland/liberland_substrate#295

Merged

15 tasks

altonen mentioned this pull request Aug 24, 2023

Finality Lagging issues in Kusama archive & full nodes #13295

Closed

2 tasks

lexnv mentioned this pull request Oct 10, 2024

p2p performance issues, regression? paritytech/polkadot-sdk#6012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evict inactive peers from `SyncingEngine` #13829

Evict inactive peers from `SyncingEngine` #13829

altonen commented Apr 5, 2023

bkchr left a comment

altonen commented Apr 6, 2023

altonen commented Apr 10, 2023 •

edited

Loading

bkchr commented Apr 13, 2023

vstakhov commented Apr 19, 2023 •

edited

Loading

altonen commented Apr 19, 2023

altonen commented Apr 21, 2023

paritytech-processbot bot commented Apr 21, 2023

altonen commented Apr 21, 2023

Evict inactive peers from SyncingEngine #13829

Evict inactive peers from SyncingEngine #13829

Conversation

altonen commented Apr 5, 2023

bkchr left a comment

Choose a reason for hiding this comment

altonen commented Apr 6, 2023

altonen commented Apr 10, 2023 • edited Loading

bkchr commented Apr 13, 2023

vstakhov commented Apr 19, 2023 • edited Loading

altonen commented Apr 19, 2023

altonen commented Apr 21, 2023

paritytech-processbot bot commented Apr 21, 2023

altonen commented Apr 21, 2023

Evict inactive peers from `SyncingEngine` #13829

Evict inactive peers from `SyncingEngine` #13829

altonen commented Apr 10, 2023 •

edited

Loading

vstakhov commented Apr 19, 2023 •

edited

Loading