Bandwidth usage 10x higher with v9420 #2563

tugytur · 2023-05-10T19:45:53Z

Bandwidth usage with the release v9420 is more than 10x higher as with v9400.

This was observed on both bootnodes and rpc nodes for the following systemchains:

statemint
statemine
westmint
collectives-polkadot
collectives-kusama
collectives-westend
bridgehub-kusama

These are the flags from the collectives-westend rpc node:

ExecStart=/usr/local/bin/polkadot-parachain \
        --name="rpcwco-001" \
        --public-addr=/ip4/X.X.X.X/tcp/30346 \
        --listen-addr=/ip4/X.X.X.X/tcp/30346 \
        --chain=collectives-westend \
        --pruning=archive \
        --database=paritydb \
        --prometheus-port=9615 \
        --rpc-port 9933 \
        --ws-port 9944 \
        --no-telemetry \
        --no-hardware-benchmarks \
        --no-mdns \
        --no-private-ip \
        --max-runtime-instances=1 \
        --in-peers=5 \
        --out-peers=5 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --rpc-methods=safe \
        --rpc-cors=all \
        --relay-chain-rpc-urls="ws://X.X.X.X:9944" \
        --relay-chain-rpc-urls="ws://X.X.X.X:9944" \
        --base-path=/var/blockchain

Average for both up and download was on average under 1Mb/s with the flags above. After the update it went > 10 Mb/s. The decrease at the end was when we downgraded.

The text was updated successfully, but these errors were encountered:

altonen · 2023-05-11T08:04:43Z

Can you check in Grafana (substrate_sub_libp2p_notifications_sizes) if there is a protocol that is obviously sending more data than before?

tugytur · 2023-05-11T08:49:10Z

@altonen sure,

The results are for all rpc and bootndes that were updated.

substrate_sub_libp2p_notifications_sizes_count{service=~"bootnode|rpc",network!~"polkadot|kusama|westend"}

Right after the update the value is somewhere between 6-30

substrate_sub_libp2p_notifications_sizes_bucket{service=~"bootnode|rpc",network!~"polkadot|kusama|westend"}

Right after the update the value is somewhere between 6-30

substrate_sub_libp2p_notifications_sizes_sum{service=~"bootnode|rpc",network!~"polkadot|kusama|westend"}

There is no peak and some protocols were completely missing with the 9420 release.

altonen · 2023-05-15T08:20:05Z

@tugytur

I don't see any changes in these graphs. Did you compare the changes in notification data usage between versions or is it indistinguishable in these graphs you posted? I will see if I can reproduce the data usage issue locally.

Are you able to notice this only when you're running polkadot-parachain, i.e., the data usage is the same between version if you run a Polkadot binary?

/block-announces/1 is not part of the notification sizes in 0.9.42 because we had to create a hack to make it work but I'll try to add it to Prometheus metrics this week.

bkchr · 2023-05-26T07:55:54Z

@tugytur any update?

tugytur · 2023-05-31T08:37:49Z

@tugytur

I don't see any changes in these graphs. Did you compare the changes in notification data usage between versions or is it indistinguishable in these graphs you posted? I will see if I can reproduce the data usage issue locally.

Are you able to notice this only when you're running polkadot-parachain, i.e., the data usage is the same between version if you run a Polkadot binary?

/block-announces/1 is not part of the notification sizes in 0.9.42 because we had to create a hack to make it work but I'll try to add it to Prometheus metrics this week.

I just did a screenshot of the time period before the update was applied and after.

I checked our Relaychain RPC nodes which are on v0.9.42, I don't see this behavior on those.

altonen · 2023-05-31T08:43:03Z

@bkchr @skunert

any idea?

altonen · 2023-05-31T10:06:56Z

@tugytur
can you graph substrate_sub_libp2p_requests_in/out_success/failure_total before and after the release?

tugytur · 2023-05-31T10:36:16Z

@altonen sure, This graphs are from the westend collectives nodes.
Update to 0.9.420 was at 20:52 and downgrade to 0.9.400 was at 21:33.

rate(substrate_sub_libp2p_requests_in_failure_total{instance="X"}[$__rate_interval])

substrate_sub_libp2p_requests_out_failure_total{instance="X"}

No Data

rate(substrate_sub_libp2p_requests_in_success_total_count{instance="X"}[$__rate_interval])

rate(substrate_sub_libp2p_requests_out_success_total_count{instance="X"}[$__rate_interval])

skunert · 2023-05-31T14:41:03Z

Which state is your node in? Do you see anything interesting in the logs?
I started a similar node for statemine and with external rpc kusama node and the machines has less than 500kb/s network usage.

tugytur · 2023-05-31T15:35:36Z

Which state is your node in? Do you see anything interesting in the logs? I started a similar node for statemine and with external rpc kusama node and the machines has less than 500kb/s network usage.

I just tested it again with statemine.
Logs show nothing out of ordinary (default logging) and rpc works as intended.

Would you like me to start up with other logging settings?
Did you also startup your node with --relay-chain-rpc-urls?

skunert · 2023-06-01T09:38:10Z

The command I used was polkadot-parachain -lsync=trace --sync warp --chain=statemine --database=paritydb --prometheus-external --prometheus-port=9615 --rpc-port 9944 --max-parallel-downloads=2 --runtime-cache-size=1 --execution=wasm --wasm-execution=compiled --relay-chain-rpc-urls="wss://kusama-rpc.polkadot.io" --base-path=./statemine-rpc-node, so quite similar to yours. Version was 0.9.420-c608131f0c6.

You could enable -lsync=trace and post the logs here, maybe there is something interesting.

skunert · 2023-06-06T15:02:31Z

I kept the node running a bit longer with the settings from my last post, so far nothing looks suspicious.

altonen · 2023-06-07T13:50:15Z

Could this have something to do with light clients? If you reduce the number of inbound light peers with --in-peers-light, does bandwidth usage decrease?

tugytur · 2023-06-08T17:22:45Z

The command I used was polkadot-parachain -lsync=trace --sync warp --chain=statemine --database=paritydb --prometheus-external --prometheus-port=9615 --rpc-port 9944 --max-parallel-downloads=2 --runtime-cache-size=1 --execution=wasm --wasm-execution=compiled --relay-chain-rpc-urls="wss://kusama-rpc.polkadot.io" --base-path=./statemine-rpc-node, so quite similar to yours. Version was 0.9.420-c608131f0c6.

You could enable -lsync=trace and post the logs here, maybe there is something interesting.

I've tested with the exact same flags as you, still have the issue. The Logs don't show anything out of the ordinary. I've also synced from scratch and the issue is still there. I noticed your 9.420 build is different from the one provided by the repo. I'll later build the main branch and try again.

Could this have something to do with light clients? If you reduce the number of inbound light peers with --in-peers-light, does bandwidth usage decrease?

Reducing the number didn't have any implication

tugytur · 2023-06-09T13:56:51Z

@skunert

I've build the latest main branch 0.9.420-9260459eb32

After restart of the service I see the usual high traffic, but it goes down rather fast. Which wasn't the case with the repo build.

Now I see around ~1.9 MB/s for download and upload, which is still higher than before, but not as bad. With 0.9.400 we had around ~1.2 MB/s for download and upload.

It seems I'm able to drop it further with adjusting-in-peers-light, but would need to let it run for longer to be sure.

(The first white line is where I updated and the remaining white lines are where I restarted with chaning parameters)

Would you like me to do some other tests or should I close the issue?

skunert · 2023-06-12T06:41:28Z

It is strange that the traffic still looks a bit higher than before. Just making sure, this is still with these flags?
polkadot-parachain -lsync=trace --sync warp --chain=statemine --database=paritydb --prometheus-external --prometheus-port=9615 --rpc-port 9944 --max-parallel-downloads=2 --runtime-cache-size=1 --execution=wasm --wasm-execution=compiled --relay-chain-rpc-urls="wss://kusama-rpc.polkadot.io" --base-path=./statemine-rpc-node
Can you upload some logs for us with -lsync=trace so we can take a look?

@altonen Any ideas what else we could check?

altonen · 2023-06-12T06:46:54Z

-lsync,sub-libp2p=trace would be good

I don't have any new ideas at this point but it's very strange that's for sure, maybe the logs will reveal something.

tugytur · 2023-06-12T14:52:41Z

@skunert , I used those flags.

@altonen I've used both log flags and you can find the logs here: https://drive.google.com/file/d/177WOWTtP_DFUh9TK219sAJ3o1vg6Kjex/view?usp=sharing

For this one I used our statemine bootnode. The logs were with the provided 9.420 release from the repo polkadot-parachain_v0.9.420 0.9.420-843a5095544

Flags used were:

ExecStart=/usr/local/bin/polkadot-parachain \
        --name="Amforc Bootnode" \
        -lsync,sub-libp2p=trace \
        --public-addr=/dns/statemine.bootnode.amforc.com/tcp/30333/wss \
        --listen-addr=/ip4/127.0.0.1/tcp/30336/ws \
        --public-addr=/ip4/213.167.226.186/tcp/30336 \
        --listen-addr=/ip4/10.10.150.51/tcp/30336 \
        --chain=statemine \
        --pruning=10 \
        --database=paritydb \
        --prometheus-port=9615 \
        --rpc-port 9933 \
        --ws-port 9944 \
        --no-telemetry \
        --no-mdns \
        --no-private-ip \
        --no-hardware-benchmarks \
        --max-runtime-instances=1 \
        --in-peers=5 \
        --out-peers=4 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --rpc-methods=safe \
        --rpc-cors=all \
        --relay-chain-rpc-urls="ws://10.10.152.11:9944" \
        --relay-chain-rpc-urls="ws://10.10.152.12:9944" \
        --base-path=/var/blockchain

altonen · 2023-06-14T05:45:08Z

I can't see anything concerning in the logs. The parachain only received ~60 block announcements, each of them 650 bytes and received a handful of block response which are likely not the cause.

There was recently a report about finality stalling for some people and the logs were filled with GRANDPA messages, hundreds and hundreds of them. That could possibly explain why the bandwidth usage was high but in these logs, at least for the parachain, I can't see anything. If you want to run the old problematic version of the node again, this time with -lsync,sub-libp2p=trace for both the relay and parachain, I can take a look again. We can also close this issue if you feel like the issue has been fixed and the bandwidth usage is acceptable.

tugytur · 2023-06-20T16:57:37Z

@altonen I've tested the latest release polkadot-parachain_v0.9.430 0.9.430-9cb14fe3cee

The issue is still there with the flags from above. But I've tested a little more and I think it's caused by the relay-chain instance that is started (despite --relay-chain-rpc flag)

I've appended the node with the following additional flags:

        --rpc-methods=safe \
        --rpc-cors=all \
        --base-path=/var/blockchain \
        --relay-chain-rpc-urls="ws://10.10.152.11:9944" \
        --relay-chain-rpc-urls="ws://10.10.152.12:9944" \
        -- \
        --chain=kusama \
        --in-peers=0 \
        --out-peers=0 \
        --no-telemetry \
        --no-mdns \
        --no-private-ip \
        --no-hardware-benchmarks \
        --in-peers-light=0

With these flags it wend from the average 15-20MB download/upload to ~2MB/s download and ~5MB/s upload. It's still quite alot compared to before but the issue could be maybe there?

altonen · 2023-06-21T15:33:47Z

I'll get back to this issue soon and see if I can reproduce the issue.

... I think it's caused by the relay-chain instance that is started (despite --relay-chain-rpc flag)

Is this correct behavior @skunert?

altonen · 2023-06-26T14:20:46Z

@tugytur

Are you able to reproduce this inn another environment? This is the only report on bandwidth we've received and I'm also unable to reproduce this. My total bandwidth is 1-3 Mbps on latest master.

Can you graph substrate_sub_libp2p_network_bytes_total and compare that to your RX/TX?

skunert · 2023-06-27T08:31:51Z

I'll get back to this issue soon and see if I can reproduce the issue.

... I think it's caused by the relay-chain instance that is started (despite --relay-chain-rpc flag)

Is this correct behavior @skunert?

@altonen Sorry I missed this comment. The subsystems that are started in the minimal node have network activity, but it should not be significant. I will also start one up again and see if the subsystems send anything unusual.

skunert · 2023-06-27T14:11:27Z

Checked again, still can not reproduce. If the relay chain node was the culprit, I would expect to see some messages from the parachain log target which I enabled this time. But there is nothing unexpected.

Some more ideas:

What does network activity look like if you start a node without --relay-chain-rpc-urls. This will spawn a full polkadot node internally. But even with that I do not expect 20mbps up/download all the time. So this is worth checking to see if it makes a difference.
For the node with --relay-chain-rpc-urls can you enable the -lparachain=trace,cumulus-pov-recovery=trace log targets additionally?

bkchr · 2023-06-28T10:36:06Z

As @tugytur is running a bootnode, we should look into the stats of one of our bootnodes.

altonen · 2023-06-29T14:46:33Z

@tugytur

Do you experience higher syncing speeds between the versions? I'm able to reproduce the higher bandwidth usage on my other computer but while on this computer I get ~70 bps for syncing, on the other computer I get +500 bps which does account for a large portion of the used bandwidth. substrate_sub_libp2p_network_bytes_total reports TX to be around 1 KB and RX to be 1000 - 1200 KB per interval.

I'm wondering if the other traffic is accounted for by the relay chain RPC connection but I don't know how to see those numbers and I couldn't find any Prometheus metrics for it either.

skunert · 2023-07-05T10:19:07Z

I'm wondering if the other traffic is accounted for by the relay chain RPC connection but I don't know how to see those numbers and I couldn't find any Prometheus metrics for it either.

The traffic over the RPC connection is really low. We are listening to import and finality notifications, and thats it. There are some runtime calls we do but these are mostly for collators. But yeah I should maybe introduce some metrics for this.

tugytur · 2023-07-05T18:25:06Z

I tested today by running the relaychain node as well instead of using the relay-chain-rpc-urls flag, still same issue.

To be sure that the issue is not caused by our own hypervisor/infrastructure I did some testing today with a ec2 instance on aws.

There were no firewall restrictions and I removed many of the flags just in case.
I did tests for all bridge-hub systemchains. The results were quite different.
the binary build is the latest available version from the repo.

ExecStart=/usr/local/bin/polkadot-parachain \
        --name="rpcpbh-003" \
        --chain=bridge-hub-polkadot \
        --database=paritydb \
        --sync=warp \
        --prometheus-port=9615 \
        --rpc-port 9944 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --relay-chain-rpc-urls="wss://rpc.polkadot.io" \
        --base-path=/var/blockchain

^This one averaged around 18Mb/s for each direction.

[Service]
User=blockchain
Group= blockchain
ExecStart=/usr/local/bin/polkadot-parachain \
        --name="rpcpbh-003" \
        --chain=bridge-hub-kusama \
        --database=paritydb \
        --sync=warp \
        --prometheus-port=9615 \
        --rpc-port 9944 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --relay-chain-rpc-urls="wss://kusama-rpc.polkadot.io" \
        --base-path=/var/blockchain

^This one averaged around 13Mb/s for each direction.

User=blockchain
Group= blockchain
ExecStart=/usr/local/bin/polkadot-parachain \
        --name="rpcpbh-003" \
        --chain=bridge-hub-westend \
        --database=paritydb \
        --sync=warp \
        --prometheus-port=9615 \
        --rpc-port 9944 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --relay-chain-rpc-urls="wss://westend-rpc.polkadot.io" \
        --base-path=/var/blockchain

^This one is the interesting one. With westend the average is 2.4Mb/s for each direction.

[Service]
User=blockchain
Group= blockchain
ExecStart=/usr/local/bin/polkadot-parachain \
        --name="rpcpbh-003" \
        --chain=bridge-hub-westend \
        --database=paritydb \
        --sync=warp \
        --prometheus-port=9615 \
        --rpc-port 9944 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --relay-chain-rpc-urls="wss://westend-rpc.polkadot.io" \
        --base-path=/var/blockchain \
        -- \
        --in-peers=0 \
        --out-peers=0 \
        --in-peers-light=0

^Here I re-added some flags and got down to 300kb/s each direction.

altonen · 2023-07-17T08:12:11Z

@tugytur

Is the number of peers and syncing speeds the same between your test runs across different chains? Setting inbound/outbound full/light peers to zero basically prevent the SyncingEngine from accepting any peer (except for reserved peers) so that would explain why the used bandwidth is much smaller.

If the bandwidth is explained by syncing-related block downloads, I think what's interesting is why is it so low on 0.9.40 if the peer counts are the same between the versions.

AurevoirXavier mentioned this issue Jun 24, 2023

Number of relaychain peers isn't limited darwinia-network/darwinia#1182

Closed

skunert mentioned this issue Nov 7, 2023

Disable incoming light-client connections for minimal relay node paritytech/polkadot-sdk#2202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bandwidth usage 10x higher with v9420 #2563

Bandwidth usage 10x higher with v9420 #2563

tugytur commented May 10, 2023

altonen commented May 11, 2023

tugytur commented May 11, 2023

altonen commented May 15, 2023 •

edited

Loading

bkchr commented May 26, 2023

tugytur commented May 31, 2023

altonen commented May 31, 2023

altonen commented May 31, 2023

tugytur commented May 31, 2023

skunert commented May 31, 2023

tugytur commented May 31, 2023

skunert commented Jun 1, 2023

skunert commented Jun 6, 2023

altonen commented Jun 7, 2023

tugytur commented Jun 8, 2023

tugytur commented Jun 9, 2023

skunert commented Jun 12, 2023

altonen commented Jun 12, 2023

tugytur commented Jun 12, 2023

altonen commented Jun 14, 2023

tugytur commented Jun 20, 2023

altonen commented Jun 21, 2023

altonen commented Jun 26, 2023

skunert commented Jun 27, 2023

skunert commented Jun 27, 2023

bkchr commented Jun 28, 2023

altonen commented Jun 29, 2023

skunert commented Jul 5, 2023

tugytur commented Jul 5, 2023 •

edited

Loading

altonen commented Jul 17, 2023

Bandwidth usage 10x higher with v9420 #2563

Bandwidth usage 10x higher with v9420 #2563

Comments

tugytur commented May 10, 2023

altonen commented May 11, 2023

tugytur commented May 11, 2023

altonen commented May 15, 2023 • edited Loading

bkchr commented May 26, 2023

tugytur commented May 31, 2023

altonen commented May 31, 2023

altonen commented May 31, 2023

tugytur commented May 31, 2023

skunert commented May 31, 2023

tugytur commented May 31, 2023

skunert commented Jun 1, 2023

skunert commented Jun 6, 2023

altonen commented Jun 7, 2023

tugytur commented Jun 8, 2023

tugytur commented Jun 9, 2023

skunert commented Jun 12, 2023

altonen commented Jun 12, 2023

tugytur commented Jun 12, 2023

altonen commented Jun 14, 2023

tugytur commented Jun 20, 2023

altonen commented Jun 21, 2023

altonen commented Jun 26, 2023

skunert commented Jun 27, 2023

skunert commented Jun 27, 2023

bkchr commented Jun 28, 2023

altonen commented Jun 29, 2023

skunert commented Jul 5, 2023

tugytur commented Jul 5, 2023 • edited Loading

altonen commented Jul 17, 2023

altonen commented May 15, 2023 •

edited

Loading

tugytur commented Jul 5, 2023 •

edited

Loading