-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Subsystem (av-store) stalled while major syncing #6656
Comments
|
This is a duplicate of: paritytech/substrate#13254 |
This is "interesting", ty for the report. |
Thanks, this is indeed interesting, looks like av store gets stuck for some reason: One possible cause could be the pruning that is one big storage prefix iteration: https://github.com/paritytech/polkadot/blob/master/node/core/av-store/src/lib.rs#L1219. @dolphintwo can you share a heatmap view for these histogram metrics: |
FWIW I also seen this recently on some Versi tests. |
@sandreim Is this heatmap okey? my node is still keep restarting |
Not really, I am using this query for my testnet (you would need to change the labels to match on your nodes and Would be good to see the values before the issue started as well. |
The node here is clearly in major sync. In theory it shouldn't receive block import and finality notifications, but looks like it did. The slowness is either in processing new leaf (and going many blocks to the latest known finalized head) or pruning on finality. For pruning, we could limit the amount of pruned keys, similar to
|
Block import notifications are not being sent when doing a major sync. However, finality notification send per justification we import, aka once per era. |
Interesting, when I remove "--validator" and start again, I don't seem to encounter this problem again. |
@dolphintwo could you reproduce the problem with 0.9.36? I suspect that #6452 that got into 0.9.37 release fixed one problem and created another one. #6691 might revert the side-effects of that PR. |
grandpa-voter
failed. Shutting down service.
I can reproduce this with a normal non-validator archive node of Kusama. |
I have tested it several times and it is true that only version |
@dolphintwo @dolphintwo since there are still PRs open related to the above issue, I want to share a piece of log from a parachain client built on top of 0.9.37. This log is similar to #6656 (comment) but the order of subsystems exiting is a bit different. The node was during a major sync when this repeatedly happened but a restart could help it to proceed:
I noticed similar error was also reported here. |
@bkchr I was able to reproduce this issue also in 0.9.39-1 . However, by adding the flag --validator with the ksmmc3 size of 15 Gb, it results in fail with the same error as @aliXsed is experiencing in Parachain (where dot/ksm is a relay)
|
Can you reproduce with 0.9.40? |
Unfortunately, the bug also is occurring on 0.9.40
|
@bkchr The problem is still there on 0.9.41 .
Is there a safe value to test --blocks-pruning ? or better to leave only if used as relaychain? |
Best to not run with blocks pruning. TBH, I will create a pr to disable blocks pruning for now. |
I suggest disabling only if used in combo with --validator, otherwise, without the --validator flag (example on relaychain), block pruning is working fine and helps in saving disk space. |
See this in polkadot-v0.9.41, this node is running
it will raise random minutes, before that, everything looks good Update: the node Polkadot DB rsync from a polkadot node (archive & rocksdb), origin node seems broken too, and shows more detail
|
Posting this log as an FYI...
We had a node crash with the following error. The server came back up via systemd, and there have been no issues since.
The text was updated successfully, but these errors were encountered: