Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

litep2p/kad: Configure periodic network bootstrap for better throughput #4942

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

lexnv
Copy link
Contributor

@lexnv lexnv commented Jul 4, 2024

This PR modified the periodic network bootstrap process (submitting kad FIND_NODE command with random peerIDs) to improve the number of connected peers:

  • Initially, the kad queries are submitted exponentially from 5 sec until converging to 2 minutes
  • If a query fails, the timer is reset to 5 sec
  • A maximum 16 kad queries are allowed to exist at a given time (although in practice we have a maximum of 1 query in flight)
  • Queries are not initiated if we are connected to a healthy number of peers (this is similar to libp2p)

The old behavior:

  • only one query in flight
  • interval of queries: exp from 5 secs to 1 minute (so we submitted approx. around 1 query per minute)

For my full node started in Kusama, I observed that queries finish under 1 minute.
What I expect is happening for a long running node:

  • the node is loosing peers (probably due to the peerset banning and disconnecting -- to be investigated / addressed in the future)
  • kad queries can take significantly longer to finish (even over 2 minutes in a toy-app using standalone litep2p)

Testing Done

Started 2 nodes with --in-peers 50 --out-peers 50:

  • green line represents number of connected peers with litep2p (and this PR)
  • yellow is the equivalent for libp2p (current backend)

Screenshot 2024-07-03 at 23 11 52

Part of:

cc @paritytech/networking

lexnv added 8 commits July 4, 2024 11:20
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv added A1-insubstantial Pull request requires no code review (e.g., a sub-repository hash update). R0-silent Changes should not be mentioned in any release notes I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. I5-enhancement An additional feature request. D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. labels Jul 4, 2024
@lexnv lexnv self-assigned this Jul 4, 2024
lexnv added 4 commits July 8, 2024 14:39
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
/// Number of connected peers over the block announce protocol.
///
/// This is used to update metrics and network status.
num_sync_connected: Arc<AtomicUsize>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a metric that should be moved to the sync crate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that sounds good, have created an issue for this:

Since this was a bit involved for the peerstore metrics :D

substrate/client/network/src/litep2p/mod.rs Outdated Show resolved Hide resolved
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@dmitry-markin
Copy link
Contributor

dmitry-markin commented Jul 16, 2024

Is the algorithm the same as what libp2p is doing? I don't mind merging this, but it's possible the root issue is somewhere else, and would be masked once we merge this PR. Also, comparing libp2p and litep2pon 2h time interval can be influenced by random factors, so it's not necessarily evident that the litep2p approach from this PR is significantly better to what libp2p is doing.

@@ -439,6 +441,9 @@ impl<B: BlockT + 'static, H: ExHashT> NetworkBackend<B, H> for Litep2pNetworkBac
let peer_store_handle = params.network_config.peer_store_handle();
let executor = Arc::new(Litep2pExecutor { executor: params.executor });

let limit_discovery_under =
params.network_config.network_config.default_peers_set.out_peers as usize + 15;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if in_peers is lower than 15?

Do you think the discovery should be guided by the number of sync peers connected, and not by the number of known peers?

Also, I would make sure we do a random walk at least once per minute as it was before, even if we have enough peers connected (make discovery_only_if_under_num slow down queries, but not stop them completely).

@lexnv
Copy link
Contributor Author

lexnv commented Jul 16, 2024

Nice catch indeed, we can remove the limit under config. The extra load on the network should be negligible and at the same time remove some extra logic from our code.

Initially, I used sync peers but can't remember why I transitioned to what libp2p is doing.

After I make the adjustemt and remove the limit, I think we'll have the benefit of not waiting for slow queries to finish before starting a new one. This should be a bit better than before, we could keep the PR around until we find the root cause (also think its safe to merge but lets be extra sure here) 👍

A few ideas to tackle this next:

  • We could probably report back the total number of connected peers in metrics (not only the sync peers)
  • Litep2p kademlia tables or kademlia queries might have some wrong logic
  • Maybe we could try to evit some peers that are not responding to the sync protocol (could correlate with the metric)

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
This is mainly done to keep a healthy subset of the network
in the node's memory and routing table. Otherwise, we may risk
trading off discoverability with protocol performance, which
is not entirely desirable.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv changed the title litep2p/kad: Configure periodic network bootstrap for higher connected peers litep2p/kad: Configure periodic network bootstrap for better throughput Sep 17, 2024
@lexnv
Copy link
Contributor Author

lexnv commented Sep 17, 2024

I changed the algorithm for kademlia discovery from litep2p side to:

  • check every 2 minutes that we have a healthy number of peers (double the number of peers libp2p is considering healthy), if we are under this treashold we'll perform a kademlia query
  • perform a mandatory kademlia query every 30 minutes

The main goal was to improve the network throughput as discussed in this forum post.

Will provide some more details about performance in a bit, however short-term 6-8h data looks extremely promising.

Offhand, it looks like we were performing too many KAD queries, leading to a more dynamic view of the network. In terms of consuming resources to dial, submit queries, handle responses, peer evictions from the routing table etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A1-insubstantial Pull request requires no code review (e.g., a sub-repository hash update). D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. I5-enhancement An additional feature request. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. R0-silent Changes should not be mentioned in any release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants