-
Notifications
You must be signed in to change notification settings - Fork 378
Bandwidth usage 10x higher with v9420 #2563
Comments
Can you check in Grafana ( |
@altonen sure, The results are for all rpc and bootndes that were updated.
Right after the update the value is somewhere between 6-30
Right after the update the value is somewhere between 6-30
There is no peak and some protocols were completely missing with the 9420 release. |
I don't see any changes in these graphs. Did you compare the changes in notification data usage between versions or is it indistinguishable in these graphs you posted? I will see if I can reproduce the data usage issue locally. Are you able to notice this only when you're running
|
@tugytur any update? |
I just did a screenshot of the time period before the update was applied and after. I checked our Relaychain RPC nodes which are on v0.9.42, I don't see this behavior on those. |
@tugytur |
@altonen sure, This graphs are from the westend collectives nodes.
No Data
|
Which state is your node in? Do you see anything interesting in the logs? |
The command I used was You could enable |
Could this have something to do with light clients? If you reduce the number of inbound light peers with |
I've tested with the exact same flags as you, still have the issue. The Logs don't show anything out of the ordinary. I've also synced from scratch and the issue is still there. I noticed your 9.420 build is different from the one provided by the repo. I'll later build the main branch and try again.
Reducing the number didn't have any implication |
It is strange that the traffic still looks a bit higher than before. Just making sure, this is still with these flags? @altonen Any ideas what else we could check? |
I don't have any new ideas at this point but it's very strange that's for sure, maybe the logs will reveal something. |
@skunert , I used those flags. @altonen I've used both log flags and you can find the logs here: https://drive.google.com/file/d/177WOWTtP_DFUh9TK219sAJ3o1vg6Kjex/view?usp=sharing For this one I used our statemine bootnode. The logs were with the provided 9.420 release from the repo Flags used were:
|
I can't see anything concerning in the logs. The parachain only received ~60 block announcements, each of them 650 bytes and received a handful of block response which are likely not the cause. There was recently a report about finality stalling for some people and the logs were filled with GRANDPA messages, hundreds and hundreds of them. That could possibly explain why the bandwidth usage was high but in these logs, at least for the parachain, I can't see anything. If you want to run the old problematic version of the node again, this time with |
@altonen I've tested the latest release The issue is still there with the flags from above. But I've tested a little more and I think it's caused by the relay-chain instance that is started (despite --relay-chain-rpc flag) I've appended the node with the following additional flags:
With these flags it wend from the average 15-20MB download/upload to ~2MB/s download and ~5MB/s upload. It's still quite alot compared to before but the issue could be maybe there? |
I'll get back to this issue soon and see if I can reproduce the issue.
Is this correct behavior @skunert? |
Are you able to reproduce this inn another environment? This is the only report on bandwidth we've received and I'm also unable to reproduce this. My total bandwidth is 1-3 Mbps on latest master. Can you graph |
@altonen Sorry I missed this comment. The subsystems that are started in the minimal node have network activity, but it should not be significant. I will also start one up again and see if the subsystems send anything unusual. |
Checked again, still can not reproduce. If the relay chain node was the culprit, I would expect to see some messages from the Some more ideas:
|
As @tugytur is running a bootnode, we should look into the stats of one of our bootnodes. |
Do you experience higher syncing speeds between the versions? I'm able to reproduce the higher bandwidth usage on my other computer but while on this computer I get ~70 bps for syncing, on the other computer I get +500 bps which does account for a large portion of the used bandwidth. I'm wondering if the other traffic is accounted for by the relay chain RPC connection but I don't know how to see those numbers and I couldn't find any Prometheus metrics for it either. |
The traffic over the RPC connection is really low. We are listening to import and finality notifications, and thats it. There are some runtime calls we do but these are mostly for collators. But yeah I should maybe introduce some metrics for this. |
I tested today by running the relaychain node as well instead of using the relay-chain-rpc-urls flag, still same issue. To be sure that the issue is not caused by our own hypervisor/infrastructure I did some testing today with a ec2 instance on aws. There were no firewall restrictions and I removed many of the flags just in case.
^This one averaged around 18Mb/s for each direction.
^This one averaged around 13Mb/s for each direction.
^This one is the interesting one. With westend the average is 2.4Mb/s for each direction.
^Here I re-added some flags and got down to 300kb/s each direction. |
Is the number of peers and syncing speeds the same between your test runs across different chains? Setting inbound/outbound full/light peers to zero basically prevent the If the bandwidth is explained by syncing-related block downloads, I think what's interesting is why is it so low on 0.9.40 if the peer counts are the same between the versions. |
Bandwidth usage with the release v9420 is more than 10x higher as with v9400.
This was observed on both bootnodes and rpc nodes for the following systemchains:
These are the flags from the collectives-westend rpc node:
Average for both up and download was on average under 1Mb/s with the flags above. After the update it went > 10 Mb/s. The decrease at the end was when we downgraded.
The text was updated successfully, but these errors were encountered: