-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Networking tasks CPU usage is too high #702
Comments
We're working on fixing these issues. One problem is that every event is routed through |
Can you share the issue for tracking this fix ? Do you have some numbers on what kind of a drop in CPU usage we are expecting with the fix? |
Sorry I don't have any numbers for you right now. I vaguely remember some previous discussion about this where |
Fixing this issue is very important to us in the context of increasing the Since we are consistently reproducing this on Versi right now (400 validators, 80 parachains), I think it would be a good idea to investigate. @vstakhov will help with profiling and also fixes. |
libp2p discussion: libp2p/rust-libp2p#3840 |
Another thing to dive deeper into is the req/response implementation. It seems that the high network CPU load is correlated to the amount of these. At first glance these are handled differently than gossip messages. |
One thing that's quite suboptimal about the current implementation is that it's able to process one request at a time. Once a request is received, the code makes an async call to |
Thanks for the update! CC @alindima can you please test paritytech/substrate#14337 on Versi and see if it improves CPU usage and av chunk fetching time ? @altonen why do we need to get the reputation of the peer when receiving a request ? |
@sandreim So that the request can be rejected if it comes from a banned peer |
Here are the results of testing paritytech/substrate#14337 in versi. Network CPU utilisation is not noticeably any different: Availability chunk fetch request time is also not noticeably different (the new image was deployed at around 10 AM on the graph): However, the overall 95th percentile looks very similar. Another interesting finding is that nodes run into some errors and seem to crash sometimes, see: paritytech/substrate#14337 (comment) There is also a very large increase in the libp2p errors for outgoing connections, which is correlated to the deployment of this test image: You can see two large hills that correspond to two deployments. In between those two hills, master code was used. My guess is that those errors also correspond to the increase in requests that take longer than 10s to get a response. CC: @dmitry-markin |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/altering-polkadots-fork-choice-to-reduce-da-load/3389/13 |
We should try using Pyroscope to easily get/analyze more flamegraphs from Versi during load testing |
done. see comment here: libp2p/rust-libp2p#3840 (reply in thread) |
it currently only used in test and mock
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
* Add some logging to the finality verifier pallet * Add finality target to happy path log
Closing this as solved by litep2p, but we should reopen if we are not happy with what we get in production. |
While running load tests on Versi (with PR paritytech/polkadot#6810) I've observed the networking tasks eat up more than 50% of the CPU with
libp2p
tasks being top consumer.Substrate task metrics dashboard: https://grafana.parity-mgmt.parity.io/goto/Ajtg2qb4z?orgId=1
The CPU usage is unreasonable high, since for example the
network-bridge-in-network-worker
consumes only 13% and deals with most of the messages as well as decoding them.The text was updated successfully, but these errors were encountered: