Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Subsystem (av-store) stalled while major syncing #6656

Closed
mchaffee opened this issue Jan 31, 2023 · 21 comments · Fixed by #6691
Closed

Subsystem (av-store) stalled while major syncing #6656

mchaffee opened this issue Jan 31, 2023 · 21 comments · Fixed by #6691
Labels
I3-bug Fails to follow expected behavior.

Comments

@mchaffee
Copy link

Posting this log as an FYI...

We had a node crash with the following error. The server came back up via systemd, and there have been no issues since.

Jan 31 13:27:42 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:27:42 ✨ Imported #16434322 (0x38c9…8e04)
Jan 31 13:27:44 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:27:44 💤 Idle (40 peers), best: #16434322 (0x38c9…8e04), finalized #16434319 (0xef3e…f2b0),>
Jan 31 13:27:48 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:27:48 ✨ Imported #16434323 (0x128b…93d7)
Jan 31 13:27:49 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:27:49 💤 Idle (40 peers), best: #16434323 (0x128b…93d7), finalized #16434320 (0x3cc2…dbb9),>
Jan 31 13:27:54 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:27:54 💤 Idle (40 peers), best: #16434323 (0x128b…93d7), finalized #16434321 (0x5e4b…323d),>
Jan 31 13:27:55 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:27:55 ✨ Imported #16434324 (0xba48…bbca)
Jan 31 13:27:55 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:27:55 Advanced session window for approvals update=Advanced { prev_window_start: 27835, pre>
Jan 31 13:27:59 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:27:59 💤 Idle (40 peers), best: #16434324 (0xba48…bbca), finalized #16434322 (0x38c9…8e04),>
Jan 31 13:28:00 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:28:00 ✨ Imported #16434325 (0xe07b…b367)
Jan 31 13:28:04 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:28:04 💤 Idle (40 peers), best: #16434325 (0xe07b…b367), finalized #16434322 (0x38c9…8e04),>
Jan 31 13:28:04 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:28:04 ♻️  Reorg on #16434325,0xe07b…b367 to #16434324,0x6e4d…69ec, common ancestor #16434323>
Jan 31 13:28:04 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:28:04 ✨ Imported #16434324 (0x6e4d…69ec)
Jan 31 13:28:04 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:28:04 GRANDPA voter error: safety invariant has been violated: SelectChain returned a final>
Jan 31 13:28:04 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:28:04 Essential task `grandpa-voter` failed. Shutting down service.
Jan 31 13:28:05 GS-Prod-VA2-VM19 polkadot[646170]: Error:
Jan 31 13:28:05 GS-Prod-VA2-VM19 polkadot[646170]:    0: Other: Essential task failed.
Jan 31 13:28:05 GS-Prod-VA2-VM19 polkadot[646170]: Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Jan 31 13:28:05 GS-Prod-VA2-VM19 polkadot[646170]: Run with RUST_BACKTRACE=full to include source snippets.
Jan 31 13:28:05 GS-Prod-VA2-VM19 systemd[1]: polkadot.service: Main process exited, code=exited, status=1/FAILURE
Jan 31 13:28:05 GS-Prod-VA2-VM19 systemd[1]: polkadot.service: Failed with result 'exit-code'.
Jan 31 13:28:05 GS-Prod-VA2-VM19 systemd[1]: polkadot.service: Consumed 2w 2d 22h 47min 18.325s CPU time.
Jan 31 13:29:05 GS-Prod-VA2-VM19 systemd[1]: polkadot.service: Scheduled restart job, restart counter is at 1.
Jan 31 13:29:05 GS-Prod-VA2-VM19 systemd[1]: Stopped Kusama Node.
Jan 31 13:29:05 GS-Prod-VA2-VM19 systemd[1]: polkadot.service: Consumed 2w 2d 22h 47min 18.325s CPU time.
Jan 31 13:29:05 GS-Prod-VA2-VM19 systemd[1]: Started Kusama Node.
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05 ----------------------------
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05 This chain is not in any way
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05       endorsed by the
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05      KUSAMA FOUNDATION
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05 ----------------------------
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05 Parity Polkadot
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05 ✌️  version 0.9.37-645723987cf
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05 📋 Chain specification: Kusama
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05 🏷  Node name: 🍀LuckyFriday-KSM-09🍀
Jan 31 13:29:05 GS-Prod-VA2-VM19 polkadot[706190]: 2023-01-31 13:29:05 👤 Role: AUTHORITY
@dolphintwo
Copy link

2023-02-01 09:12:54 ⚙️  Syncing 20.6 bps, target=#16442460 (100 peers), best: #10908405 (0x663e…8c6d), finalized #10908160 (0x63cc…9b87), ⬇ 1.1MiB/s ⬆ 2.5MiB/s
2023-02-01 09:12:59 ⚙️  Syncing 22.4 bps, target=#16442460 (100 peers), best: #10908517 (0x1444…ccd6), finalized #10908160 (0x63cc…9b87), ⬇ 948.2kiB/s ⬆ 3.8MiB/s
2023-02-01 09:13:04 ⚙️  Syncing 21.9 bps, target=#16442460 (100 peers), best: #10908627 (0x8ac6…84ac), finalized #10908160 (0x63cc…9b87), ⬇ 1.0MiB/s ⬆ 1.6MiB/s
2023-02-01 09:13:09 ⚙️  Syncing 16.4 bps, target=#16442460 (100 peers), best: #10908709 (0x9a8a…9796), finalized #10908672 (0xaa84…5473), ⬇ 1.2MiB/s ⬆ 639.5kiB/s
2023-02-01 09:13:14 ⚙️  Syncing 23.6 bps, target=#16442460 (100 peers), best: #10908827 (0x6319…a354), finalized #10908672 (0xaa84…5473), ⬇ 865.0kiB/s ⬆ 713.3kiB/s
2023-02-01 09:13:15 Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
2023-02-01 09:13:15 Essential task `overseer` failed. Shutting down service.
2023-02-01 09:13:15 subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-01 09:13:15 subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-01 09:13:15 err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-02-01 09:13:15 subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-01 09:13:15 subsystem exited with error subsystem="pvf-checker-subsystem" err=FromOrigin { origin: "pvf-checker", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-01 09:13:15 subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChannel(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-01 09:13:15 Unable to author block in slot. No best block header: Request for leaves from chain selection got canceled
2023-02-01 09:13:15 subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-01 09:13:15 subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-01 09:13:15 subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-01 09:13:15 subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-02-01 09:14:15 Detected running(potentially stalled) tasks on shutdown:
2023-02-01 09:14:15 Task "candidate-backing-subsystem" (Group: candidate-backing) was still running after waiting 60 seconds to finish.
2023-02-01 09:14:15 Task "availability-store-subsystem" (Group: availability-store) was still running after waiting 60 seconds to finish.
Error:
   0: Other: Essential task failed.
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
kusama.service: Main process exited, code=exited, status=1/FAILURE
kusama.service: Failed with result 'exit-code'.
kusama.service: Scheduled restart job, restart counter is at 13.
Stopped Kusama Node Daemon.
Started Kusama Node Daemon.
2023-02-01 09:14:20 ----------------------------
2023-02-01 09:14:20 This chain is not in any way
2023-02-01 09:14:20       endorsed by the
2023-02-01 09:14:20      KUSAMA FOUNDATION
2023-02-01 09:14:20 ----------------------------
2023-02-01 09:14:20 Parity Polkadot
2023-02-01 09:14:20 ✌️  version 0.9.37-645723987cf
2023-02-01 09:14:20 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023
2023-02-01 09:14:20 📋 Chain specification: Kusama

@bkchr
Copy link
Member

bkchr commented Feb 6, 2023

Jan 31 13:28:04 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:28:04 GRANDPA voter error: safety invariant has been violated: SelectChain returned a final>
Jan 31 13:28:04 GS-Prod-VA2-VM19 polkadot[646170]: 2023-01-31 13:28:04 Essential task grandpa-voter failed. Shutting down service.

This is a duplicate of: paritytech/substrate#13254

@bkchr
Copy link
Member

bkchr commented Feb 6, 2023

2023-02-01 09:14:15 Detected running(potentially stalled) tasks on shutdown:
2023-02-01 09:14:15 Task "candidate-backing-subsystem" (Group: candidate-backing) was still running after waiting 60 seconds to finish.
2023-02-01 09:14:15 Task "availability-store-subsystem" (Group: availability-store) was still running after waiting 60 seconds to finish.

This is "interesting", ty for the report.

@sandreim
Copy link
Contributor

sandreim commented Feb 6, 2023

Thanks, this is indeed interesting, looks like av store gets stuck for some reason: 2023-02-01 09:13:15 Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))

One possible cause could be the pruning that is one big storage prefix iteration: https://github.com/paritytech/polkadot/blob/master/node/core/av-store/src/lib.rs#L1219.

@dolphintwo can you share a heatmap view for these histogram metrics: polkadot_parachain_av_store_pruning and polkadot_parachain_av_store_process_block_finalized ?

@sandreim
Copy link
Contributor

sandreim commented Feb 6, 2023

FWIW I also seen this recently on some Versi tests.

@ordian ordian added the I3-bug Fails to follow expected behavior. label Feb 6, 2023
@dolphintwo
Copy link

@sandreim Is this heatmap okey? my node is still keep restarting
20230207173857

image

@sandreim
Copy link
Contributor

sandreim commented Feb 7, 2023

Not really, I am using this query for my testnet (you would need to change the labels to match on your nodes and polkadot chain): sum(increase(polkadot_parachain_av_store_pruning_bucket{chain="$chain", pod=~".*validator.*"}[$__rate_interval])) by (le) to get this heatmap:

Screenshot 2023-02-07 at 12 02 38

Would be good to see the values before the issue started as well.

@ordian
Copy link
Member

ordian commented Feb 7, 2023

2023-02-01 09:12:54 ⚙️  Syncing 20.6 bps, target=#16442460 (100 peers), best: #10908405 (0x663e…8c6d), finalized #10908160 (0x63cc…9b87), ⬇ 1.1MiB/s ⬆ 2.5MiB/s
2023-02-01 09:12:59 ⚙️  Syncing 22.4 bps, target=#16442460 (100 peers), best: #10908517 (0x1444…ccd6), finalized #10908160 (0x63cc…9b87), ⬇ 948.2kiB/s ⬆ 3.8MiB/s
2023-02-01 09:13:04 ⚙️  Syncing 21.9 bps, target=#16442460 (100 peers), best: #10908627 (0x8ac6…84ac), finalized #10908160 (0x63cc…9b87), ⬇ 1.0MiB/s ⬆ 1.6MiB/s
2023-02-01 09:13:09 ⚙️  Syncing 16.4 bps, target=#16442460 (100 peers), best: #10908709 (0x9a8a…9796), finalized #10908672 (0xaa84…5473), ⬇ 1.2MiB/s ⬆ 639.5kiB/s
2023-02-01 09:13:14 ⚙️  Syncing 23.6 bps, target=#16442460 (100 peers), best: #10908827 (0x6319…a354), finalized #10908672 (0xaa84…5473), ⬇ 865.0kiB/s ⬆ 713.3kiB/s
2023-02-01 09:13:15 Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))

The node here is clearly in major sync. In theory it shouldn't receive block import and finality notifications, but looks like it did.

The slowness is either in processing new leaf (and going many blocks to the latest known finalized head) or pruning on finality. For pruning, we could limit the amount of pruned keys, similar to

@bkchr
Copy link
Member

bkchr commented Feb 7, 2023

The node here is clearly in major sync. In theory it shouldn't receive block import and finality notifications, but looks like it did.

Block import notifications are not being sent when doing a major sync. However, finality notification send per justification we import, aka once per era.

@dolphintwo
Copy link

Interesting, when I remove "--validator" and start again, I don't seem to encounter this problem again.

@ordian
Copy link
Member

ordian commented Feb 8, 2023

@dolphintwo could you reproduce the problem with 0.9.36? I suspect that #6452 that got into 0.9.37 release fixed one problem and created another one. #6691 might revert the side-effects of that PR.

@ordian ordian changed the title Essential task grandpa-voter failed. Shutting down service. Subsystem (av-store) stalled while major syncing Feb 8, 2023
@krhougs
Copy link

krhougs commented Feb 9, 2023

Interesting, when I remove "--validator" and start again, I don't seem to encounter this problem again.

I can reproduce this with a normal non-validator archive node of Kusama.

@dolphintwo
Copy link

@dolphintwo could you reproduce the problem with 0.9.36? I suspect that #6452 that got into 0.9.37 release fixed one problem and created another one. #6691 might revert the side-effects of that PR.

I have tested it several times and it is true that only version 0.9.37 has this problem.

@aliXsed
Copy link

aliXsed commented Mar 1, 2023

@dolphintwo @dolphintwo since there are still PRs open related to the above issue, I want to share a piece of log from a parachain client built on top of 0.9.37. This log is similar to #6656 (comment) but the order of subsystems exiting is a bit different. The node was during a major sync when this repeatedly happened but a restart could help it to proceed:

2023-02-27 23:57:17 [Parachain] ⚙️  Syncing  0.0 bps, target=#2156686 (15 peers), best: #1557888 (0xda0e…fa46), finalized #0 (0x97da…7a21), ⬇ 109.3kiB/s ⬆ 0.5kiB/s
2023-02-27 23:57:19 [Relaychain] cannot query the runtime API version: UnknownBlock: State already discarded for 0x97eedfa8db50238ab2e113c59a8045f6dc81d9a44473165063741c2244de5a4e
2023-02-27 23:57:22 [Relaychain] ⚙️  Syncing 341.6 bps, target=#14432105 (40 peers), best: #3148687 (0x9d72…aca4), finalized #3148288 (0x0454…8616), ⬇ 344.0kiB/s ⬆ 187.3kiB/s
2023-02-27 23:57:22 [Parachain] ⚙️  Syncing  0.0 bps, target=#2156686 (15 peers), best: #1557888 (0xda0e…fa46), finalized #0 (0x97da…7a21), ⬇ 62.2kiB/s ⬆ 0.5kiB/s
2023-02-27 23:57:24 [Relaychain] cannot query the runtime API version: UnknownBlock: State already discarded for 0x97eedfa8db50238ab2e113c59a8045f6dc81d9a44473165063741c2244de5a4e
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChannel(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-27 23:57:24 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-27 23:57:24 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
2023-02-27 23:57:24 [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-27 23:57:24 [Relaychain] Essential task `overseer` failed. Shutting down service.
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-27 23:57:24 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-27 23:57:24 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-27 23:57:24 [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-02-27 23:57:24 [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-02-22 23:41:52 [Relaychain] Skipping chain: unable to gather info about imported block (0x95fa5e63d56839098dfcf3531f5481396915d7a25850f234890243fa3eacea2a, 5863250): block is from an ancient session
2023-02-22 23:41:52 [Relaychain] error=Runtime(RuntimeRequest(Execution { runtime_api_name: "on_chain_votes", source: Application(Execution(Other("Exported method ParachainHost_on_chain_votes is not found"))) }))
2023-02-22 23:41:52 [Relaychain] Received msg before first active leaves update. This is not expected - message will be dropped. msg=ActiveDisputes(Sender { complete: false })
2023-02-22 23:41:52 [Relaychain] error=Sender(AskActiveDisputesCanceled) ctx="on_message"
2023-02-22 23:41:52 [Relaychain] Failed to get session info. relay_parent=0x7d83a73a65cc7ffcad97251320db60230def1e3aa8aa756f4aae40b94a24cf40

I noticed similar error was also reported here.

@gregorst3
Copy link

gregorst3 commented Apr 4, 2023

@bkchr I was able to reproduce this issue also in 0.9.39-1 .
First, I see there are differences when syncing from 'scratch' and from 'warp sync'.
I did a sync with warp sync using --blocks-pruning=1000 --state-pruning=1000 and the size of ksmcc3 is around 250 Gb.
--validator is working fine with this.
After that, I did a clear sync with --blocks-pruning=1000 and --state-pruning=1000 without warp sync and the size of ksmcc3 is only 15 Gb.

However, by adding the flag --validator with the ksmmc3 size of 15 Gb, it results in fail with the same error as @aliXsed is experiencing in Parachain (where dot/ksm is a relay)

Apr 04 14:02:38 Kusama-/ polkadot[437799]: 2023-04-04 14:02:38.204  INFO tokio-runtime-worker substrate: 💤 Idle (40 peers), best: #17336026 (0x65d3…377a), finalized #17336023 (0xe8d1…2e96), ⬇ 826.9kiB/s ⬆ 638.2kiB/s
Apr 04 14:02:42 Kusama-/ polkadot[437799]: 2023-04-04 14:02:42.815  INFO tokio-runtime-worker substrate: ✨ Imported #17336027 (0x169e…964d)
Apr 04 14:02:43 Kusama-/ polkadot[437799]: 2023-04-04 14:02:43.205  INFO tokio-runtime-worker substrate: 💤 Idle (31 peers), best: #17336027 (0x169e…964d), finalized #17336024 (0x24c7…fdff), ⬇ 1.0MiB/s ⬆ 733.8kiB/s
Apr 04 14:02:43 Kusama-/ polkadot[437799]: 2023-04-04 14:02:43.207  INFO tokio-runtime-worker substrate: ✨ Imported #17336027 (0x8488…8a27)
Apr 04 14:02:48 Kusama-/ polkadot[437799]: 2023-04-04 14:02:48.205  INFO tokio-runtime-worker substrate: 💤 Idle (37 peers), best: #17336027 (0x169e…964d), finalized #17336024 (0x24c7…fdff), ⬇ 731.9kiB/s ⬆ 452.0kiB/s
Apr 04 14:02:48 Kusama-/ polkadot[437799]: 2023-04-04 14:02:48.858  INFO tokio-runtime-worker substrate: ✨ Imported #17336028 (0xec30…c237)
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.412  WARN tokio-runtime-worker slots: Unable to author block in slot. No best block header: Request for leaves from chain selection got canceled
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.412  INFO tokio-runtime-worker overseer: Failed to send an event to Overseer
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.412  WARN tokio-runtime-worker slots: Unable to author block in slot. No best block header: Request for leaves from chain selection got canceled
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.412  WARN tokio-runtime-worker parachain::chain-selection: err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.412 ERROR tokio-runtime-worker overseer: Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.412 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `overseer` failed. Shutting down service.
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.412 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.412 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.412 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.414 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
Apr 04 14:02:50 Kusama-/ polkadot[437799]: 2023-04-04 14:02:50.412 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 04 14:03:50 Kusama-/ polkadot[437799]: 2023-04-04 14:03:50.418 ERROR                 main sc_cli::runner: Detected running(potentially stalled) tasks on shutdown:
Apr 04 14:03:50 Kusama-/ polkadot[437799]: 2023-04-04 14:03:50.419 ERROR                 main sc_cli::runner: Task "availability-store-subsystem" (Group: availability-store) was still running after waiting 60 seconds to finish.
Apr 04 14:03:50 Kusama-/ polkadot[437799]: Error:
Apr 04 14:03:50 Kusama-/ polkadot[437799]:    0: Other: Essential task failed.
Apr 04 14:03:50 Kusama-/ polkadot[437799]: Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Apr 04 14:03:50 Kusama-/ polkadot[437799]: Run with RUST_BACKTRACE=full to include source snippets.
Apr 04 14:03:50 Kusama-/ systemd[1]: polkadot-validator.service: Main process exited, code=exited, status=1/FAILURE
Apr 04 14:03:50 Kusama-/ systemd[1]: polkadot-validator.service: Failed with result 'exit-code'.

@bkchr
Copy link
Member

bkchr commented Apr 4, 2023

Can you reproduce with 0.9.40?

@gregorst3
Copy link

Can you reproduce with 0.9.40?

Unfortunately, the bug also is occurring on 0.9.40
Can this be caused because the database was synced (from scratch) using 0.9.37?

Apr 04 20:17:57 Kusama-/ polkadot[442350]: 2023-04-04 20:17:57.448  INFO tokio-runtime-worker substrate: 💤 Idle (422 peers), best: #17339762 (0x4403…8fae), finalized #17339759 (0xb67a…b31d), ⬇ 884.7kiB/s ⬆ 419.6kiB/s
Apr 04 20:18:00 Kusama-/ polkadot[442350]: 2023-04-04 20:18:00.705  INFO tokio-runtime-worker substrate: ✨ Imported #17339763 (0xc38f…ecc7)
Apr 04 20:18:00 Kusama-/ polkadot[442350]: 2023-04-04 20:18:00.725  INFO tokio-runtime-worker parachain::dispute-coordinator: New dispute initiated for candidate. candidate_hash=0x25111f19332e4813a1fe01b6b5c3375bb4acfa225558c2a34f05464a8cf6ca06 session=29354 traceID=49270335641878441886656006501731350363
Apr 04 20:18:00 Kusama-/ polkadot[442350]: 2023-04-04 20:18:00.725  INFO tokio-runtime-worker parachain::dispute-coordinator: Dispute on candidate concluded with 'valid' result candidate_hash=0x25111f19332e4813a1fe01b6b5c3375bb4acfa225558c2a34f05464a8cf6ca06 session=29354 traceID=49270335641878441886656006501731350363
Apr 04 20:18:01 Kusama-/ polkadot[442350]: 2023-04-04 20:18:00.986  INFO tokio-runtime-worker substrate: ✨ Imported #17339763 (0x126b…582f)
Apr 04 20:18:02 Kusama-/ polkadot[442350]: 2023-04-04 20:18:02.448  INFO tokio-runtime-worker substrate: 💤 Idle (421 peers), best: #17339763 (0xc38f…ecc7), finalized #17339760 (0xd58e…6d4c), ⬇ 626.2kiB/s ⬆ 831.8kiB/s
Apr 04 20:18:06 Kusama-/ polkadot[442350]: 2023-04-04 20:18:06.911  INFO tokio-runtime-worker substrate: ✨ Imported #17339764 (0x5a45…0152)
Apr 04 20:18:07 Kusama-/ polkadot[442350]: 2023-04-04 20:18:07.455  INFO tokio-runtime-worker substrate: 💤 Idle (415 peers), best: #17339764 (0x5a45…0152), finalized #17339761 (0x5f20…7521), ⬇ 991.6kiB/s ⬆ 500.6kiB/s
Apr 04 20:18:12 Kusama-/ polkadot[442350]: 2023-04-04 20:18:12.455  INFO tokio-runtime-worker substrate: 💤 Idle (416 peers), best: #17339764 (0x5a45…0152), finalized #17339762 (0x4403…8fae), ⬇ 695.0kiB/s ⬆ 447.0kiB/s
Apr 04 20:18:12 Kusama-/ polkadot[442350]: 2023-04-04 20:18:12.746  INFO tokio-runtime-worker substrate: ✨ Imported #17339765 (0x4540…99ef)
Apr 04 20:18:17 Kusama-/ polkadot[442350]: 2023-04-04 20:18:17.455  INFO tokio-runtime-worker substrate: 💤 Idle (418 peers), best: #17339765 (0x4540…99ef), finalized #17339763 (0xc38f…ecc7), ⬇ 953.0kiB/s ⬆ 690.4kiB/s
Apr 04 20:18:18 Kusama-/ polkadot[442350]: 2023-04-04 20:18:18.581  INFO tokio-runtime-worker substrate: ✨ Imported #17339766 (0xf6fd…3035)
Apr 04 20:18:22 Kusama-/ polkadot[442350]: 2023-04-04 20:18:22.455  INFO tokio-runtime-worker substrate: 💤 Idle (418 peers), best: #17339766 (0xf6fd…3035), finalized #17339763 (0xc38f…ecc7), ⬇ 447.4kiB/s ⬆ 366.3kiB/s
Apr 04 20:18:24 Kusama-/ polkadot[442350]: 2023-04-04 20:18:24.801  INFO tokio-runtime-worker substrate: ✨ Imported #17339767 (0x5fc6…34b9)
Apr 04 20:18:27 Kusama-/ polkadot[442350]: 2023-04-04 20:18:27.456  INFO tokio-runtime-worker substrate: 💤 Idle (418 peers), best: #17339767 (0x5fc6…34b9), finalized #17339765 (0x4540…99ef), ⬇ 929.2kiB/s ⬆ 692.4kiB/s
Apr 04 20:18:30 Kusama-/ polkadot[442350]: 2023-04-04 20:18:30.679  INFO tokio-runtime-worker substrate: ✨ Imported #17339768 (0x52f2…e3fe)
Apr 04 20:18:30 Kusama-/ polkadot[442350]: 2023-04-04 20:18:30.762  INFO tokio-runtime-worker sc_informant: ♻️  Reorg on #17339768,0x52f2…e3fe to #17339768,0x1211…ef84, common ancestor #17339767,0x5fc6…34b9
Apr 04 20:18:30 Kusama-/ polkadot[442350]: 2023-04-04 20:18:30.762  INFO tokio-runtime-worker substrate: ✨ Imported #17339768 (0x1211…ef84)
Apr 04 20:18:32 Kusama-/ polkadot[442350]: 2023-04-04 20:18:32.457  INFO tokio-runtime-worker substrate: 💤 Idle (419 peers), best: #17339768 (0x1211…ef84), finalized #17339765 (0x4540…99ef), ⬇ 1.0MiB/s ⬆ 619.0kiB/s
Apr 04 20:18:36 Kusama-/ polkadot[442350]: 2023-04-04 20:18:36.655  INFO tokio-runtime-worker substrate: ✨ Imported #17339769 (0x4788…f930)
Apr 04 20:18:37 Kusama-/ polkadot[442350]: 2023-04-04 20:18:37.310  WARN tokio-runtime-worker parachain::chain-selection: err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
Apr 04 20:18:37 Kusama-/ polkadot[442350]: 2023-04-04 20:18:37.311 ERROR tokio-runtime-worker overseer: Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))
Apr 04 20:18:37 Kusama-/ polkadot[442350]: 2023-04-04 20:18:37.310 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 04 20:18:37 Kusama-/ polkadot[442350]: 2023-04-04 20:18:37.311 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `overseer` failed. Shutting down service.
Apr 04 20:18:37 Kusama-/ polkadot[442350]: 2023-04-04 20:18:37.319 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 04 20:18:37 Kusama-/ polkadot[442350]: 2023-04-04 20:18:37.319 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 04 20:18:37 Kusama-/ polkadot[442350]: 2023-04-04 20:18:37.320 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Apr 04 20:18:37 Kusama-/ polkadot[442350]: 2023-04-04 20:18:37.321 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
Apr 04 20:18:37 Kusama-/ polkadot[442350]: 2023-04-04 20:18:37.327 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 04 20:18:37 Kusama-/ polkadot[442350]: 2023-04-04 20:18:37.330 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="pvf-checker-subsystem" err=FromOrigin { origin: "pvf-checker", source: Generated(Context("Signal channel is terminated and empty.")) }
Apr 04 20:19:37 Kusama-/ polkadot[442350]: 2023-04-04 20:19:37.312 ERROR                 main sc_cli::runner: Detected running(potentially stalled) tasks on shutdown:
Apr 04 20:19:37 Kusama-/ polkadot[442350]: 2023-04-04 20:19:37.312 ERROR                 main sc_cli::runner: Task "availability-store-subsystem" (Group: availability-store) was still running after waiting 60 seconds to finish.
Apr 04 20:19:37 Kusama-/ polkadot[442350]: Error:
Apr 04 20:19:37 Kusama-/ polkadot[442350]:    0: Other: Essential task failed.
Apr 04 20:19:37 Kusama-/ polkadot[442350]: Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Apr 04 20:19:37 Kusama-/ polkadot[442350]: Run with RUST_BACKTRACE=full to include source snippets.
Apr 04 20:19:37 Kusama-/ systemd[1]: polkadot-validator.service: Main process exited, code=exited, status=1/FAILURE
Apr 04 20:19:37 Kusama-/ systemd[1]: polkadot-validator.service: Failed with result 'exit-code'.
Apr 04 20:19:37 Kusama-/ systemd[1]: polkadot-validator.service: Consumed 7min 22.210s CPU time.

@gregorst3
Copy link

@bkchr The problem is still there on 0.9.41 .
It seems like with warp-sync you 'bypass' the block-pruning=1000 flag and it does a block archive canonical sync.
When you sync from scratch with block-pruning=1000 and state-pruning=1000 , the total ksm db is only 15 gb.
However, when you try to use it with --validator flag, the node fails and crashes.

Apr 06 22:45:33 Kusama-/ polkadot[465845]: 2023-04-06 22:45:33.686  INFO tokio-runtime-worker substrate: ⚙️  Preparing 19.6 bps, target=#17369567 (40 peers), best: #17367650 (0x9e56…a599), finalized #17367552 (0x06d1…9506), ⬇ 959.6kiB/s ⬆ 223.1kiB/s
Apr 06 22:45:38 Kusama-/ polkadot[465845]: 2023-04-06 22:45:38.688  INFO tokio-runtime-worker substrate: ⚙️  Preparing 19.9 bps, target=#17369567 (40 peers), best: #17367750 (0xf2a6…da1b), finalized #17367552 (0x06d1…9506), ⬇ 558.8kiB/s ⬆ 177.0kiB/s
Apr 06 22:45:38 Kusama-/ polkadot[465845]: 2023-04-06 22:45:38.698 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
Apr 06 22:45:38 Kusama-/ polkadot[465845]: 2023-04-06 22:45:38.698  WARN tokio-runtime-worker slots: Unable to author block in slot. No best block header: Request for leaves from chain selection got canceled
Apr 06 22:45:38 Kusama-/ polkadot[465845]: 2023-04-06 22:45:38.699  INFO tokio-runtime-worker overseer: Failed to send an event to Overseer
Apr 06 22:45:38 Kusama-/ polkadot[465845]: 2023-04-06 22:45:38.699  WARN tokio-runtime-worker slots: Unable to author block in slot. No best block header: Request for leaves from chain selection got canceled
Apr 06 22:45:38 Kusama-/ polkadot[465845]: 2023-04-06 22:45:38.698  WARN tokio-runtime-worker parachain::chain-selection: err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
Apr 06 22:45:38 Kusama-/ polkadot[465845]: 2023-04-06 22:45:38.700 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))

Is there a safe value to test --blocks-pruning ? or better to leave only if used as relaychain?
(Because if I use the same db of 15 gb for relay chain no problems occur, because you are a validator on the para and not on relay)

@bkchr
Copy link
Member

bkchr commented Apr 7, 2023

Best to not run with blocks pruning. TBH, I will create a pr to disable blocks pruning for now.

@gregorst3
Copy link

gregorst3 commented Apr 7, 2023

Best to not run with blocks pruning. TBH, I will create a pr to disable blocks pruning for now.

I suggest disabling only if used in combo with --validator, otherwise, without the --validator flag (example on relaychain), block pruning is working fine and helps in saving disk space.
I'm testing it since months on ksm used as relay chain and never had problems, tests are failing only if used directly on ksm/dot (so not as relaychain on parachain) in combination with --validator

@jasl
Copy link
Contributor

jasl commented Apr 21, 2023

See this in polkadot-v0.9.41, this node is running state-pruning archive and blocks-pruning archive with rocksdb

2023-04-21 17:36:49 [Relaychain] ✨ Imported #15190522 (0x4e6f…7d7a)
2023-04-21 17:36:49 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-04-21 17:36:49 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-04-21 17:36:49 [Relaychain] subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChannel(Generated(Context("Signal channel is terminated and empty."))) }
2023-04-21 17:36:49 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-04-21 17:36:49 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
2023-04-21 17:36:49 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-04-21 17:36:49 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
2023-04-21 17:36:49 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2023-04-21 17:36:49 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-04-21 17:36:49 [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-04-21 17:36:49 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
2023-04-21 17:36:49 [Relaychain] Essential task `overseer` failed. Shutting down service.
2023-04-21 17:36:49 [Relaychain] subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-04-21 17:36:49 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2023-04-21 17:36:49 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))

it will raise random minutes, before that, everything looks good

Update: the node Polkadot DB rsync from a polkadot node (archive & rocksdb), origin node seems broken too, and shows more detail

2023-04-21 18:10:10 💤 Idle (40 peers), best: #15190853 (0xc376…9b7b), finalized #15190851 (0x6a8e…68a5), ⬇ 149.0kiB/s ⬆ 148.6kiB/s
2023-04-21 18:10:15 subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-04-21 18:10:15 subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-04-21 18:10:15 subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-04-21 18:10:15 Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
2023-04-21 18:10:15 Essential task `overseer` failed. Shutting down service.
2023-04-21 18:10:15 subsystem exited with error subsystem="pvf-checker-subsystem" err=FromOrigin { origin: "pvf-checker", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-04-21 18:10:15 subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-04-21 18:10:15 subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2023-04-21 18:10:15 subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
2023-04-21 18:10:15 err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
2023-04-21 18:10:15 Unable to author block in slot. No best block header: Request for leaves from chain selection got canceled
2023-04-21 18:10:15 Failed to send an event to Overseer
2023-04-21 18:10:15 Unable to author block in slot. No best block header: Request for leaves from chain selection got canceled
2023-04-21 18:10:15 subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-04-21 18:10:15 subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug Fails to follow expected behavior.
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

9 participants