Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking tasks CPU usage is too high #702

Closed
Tracked by #26
sandreim opened this issue Mar 6, 2023 · 14 comments
Closed
Tracked by #26

Networking tasks CPU usage is too high #702

sandreim opened this issue Mar 6, 2023 · 14 comments
Assignees

Comments

@sandreim
Copy link
Contributor

sandreim commented Mar 6, 2023

While running load tests on Versi (with PR paritytech/polkadot#6810) I've observed the networking tasks eat up more than 50% of the CPU with libp2p tasks being top consumer.

Substrate task metrics dashboard: https://grafana.parity-mgmt.parity.io/goto/Ajtg2qb4z?orgId=1

The CPU usage is unreasonable high, since for example the network-bridge-in-network-worker consumes only 13% and deals with most of the messages as well as decoding them.

@altonen
Copy link
Contributor

altonen commented Mar 6, 2023

We're working on fixing these issues. One problem is that every event is routed through NetworkWorker and that there is lot of unnecessary message broadcasting. The work on undoing this and instead reporting events directly to the protocols has started and once it's finished, the CPU load consumed by the networking subsystem should decrease.

@sandreim
Copy link
Contributor Author

sandreim commented Mar 6, 2023

Can you share the issue for tracking this fix ? Do you have some numbers on what kind of a drop in CPU usage we are expecting with the fix?

@altonen
Copy link
Contributor

altonen commented Mar 6, 2023

Sorry I don't have any numbers for you right now. I vaguely remember some previous discussion about this where NetworkWorker was the chokepoint because of the aforementioned issues. Reducing CPU usage is an emergent property of the proposed design, outlined here, which tries to address another problem (#556) so we don't really have any issue tracking the CPU usage right now. Here are two tickets that relate to this issue though and will be update as we make progress with this task: #554, paritytech/substrate#13531

@sandreim
Copy link
Contributor Author

sandreim commented Mar 7, 2023

Fixing this issue is very important to us in the context of increasing the maxValidators and number of parachains on Kusama/Polkadot. It is also very important for the Async Backing feature which we expect to double the network traffic on the validation protocol.

Since we are consistently reproducing this on Versi right now (400 validators, 80 parachains), I think it would be a good idea to investigate. @vstakhov will help with profiling and also fixes.

@sandreim
Copy link
Contributor Author

sandreim commented Apr 27, 2023

libp2p discussion: libp2p/rust-libp2p#3840

@sandreim
Copy link
Contributor Author

sandreim commented Jul 6, 2023

Another thing to dive deeper into is the req/response implementation. It seems that the high network CPU load is correlated to the amount of these. At first glance these are handled differently than gossip messages.

@altonen
Copy link
Contributor

altonen commented Jul 6, 2023

One thing that's quite suboptimal about the current implementation is that it's able to process one request at a time. Once a request is received, the code makes an async call to Peerset to fetch the reputation of the peer and received request is stored in self.message_request and if reputation hasn't been received yet, it returns Poll::Pending. This is not good and it was fixed in a recent PR but it hasn't been merged yet because there are some issues with testing.

@sandreim
Copy link
Contributor Author

sandreim commented Jul 6, 2023

Thanks for the update! CC @alindima can you please test paritytech/substrate#14337 on Versi and see if it improves CPU usage and av chunk fetching time ?

@altonen why do we need to get the reputation of the peer when receiving a request ?

@altonen
Copy link
Contributor

altonen commented Jul 11, 2023

@sandreim So that the request can be rejected if it comes from a banned peer

@alindima
Copy link
Contributor

Here are the results of testing paritytech/substrate#14337 in versi.

Network CPU utilisation is not noticeably any different:

Screenshot 2023-07-12 at 17 07 16

Availability chunk fetch request time is also not noticeably different (the new image was deployed at around 10 AM on the graph):
Screenshot 2023-07-12 at 17 04 28
You can see that the frequency with which we get responses between 5ms-10ms increases. But the frequency of high duration (over 10s) also increases.

However, the overall 95th percentile looks very similar.

Another interesting finding is that nodes run into some errors and seem to crash sometimes, see: paritytech/substrate#14337 (comment)

There is also a very large increase in the libp2p errors for outgoing connections, which is correlated to the deployment of this test image:

Screenshot 2023-07-12 at 17 08 39

You can see two large hills that correspond to two deployments. In between those two hills, master code was used.

My guess is that those errors also correspond to the increase in requests that take longer than 10s to get a response.

CC: @dmitry-markin

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/altering-polkadots-fork-choice-to-reduce-da-load/3389/13

@sandreim
Copy link
Contributor Author

We should try using Pyroscope to easily get/analyze more flamegraphs from Versi during load testing

@alindima
Copy link
Contributor

We should try using Pyroscope to easily get/analyze more flamegraphs from Versi during load testing

done. see comment here: libp2p/rust-libp2p#3840 (reply in thread)

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce moved this from Backlog to In Progress in parachains team board Oct 23, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
bkchr pushed a commit that referenced this issue Apr 10, 2024
* Add some logging to the finality verifier pallet

* Add finality target to happy path log
@sandreim
Copy link
Contributor Author

Closing this as solved by litep2p, but we should reopen if we are not happy with what we get in production.

@github-project-automation github-project-automation bot moved this from In Progress to Completed in parachains team board May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

No branches or pull requests

6 participants