Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

paritydb breaks on upgrade to v0.9.18 #5168

Closed
tugytur opened this issue Mar 21, 2022 · 22 comments
Closed

paritydb breaks on upgrade to v0.9.18 #5168

tugytur opened this issue Mar 21, 2022 · 22 comments
Assignees
Labels
I3-bug Fails to follow expected behavior.

Comments

@tugytur
Copy link
Contributor

tugytur commented Mar 21, 2022

Both our paritydb-experimental sync nodes for Kusama and Polkadot failed on the startup with the release v0.9.18.

RUST_BACKTRACE=full on both gave the same output.

WARNING: "paritydb-experimental" database setting is deprecated and will be removed in future releases. Please update your setup to use the new value: "paritydb".
2022-03-21 12:06:51 Parity Polkadot
2022-03-21 12:06:51 ✌️  version 0.9.18-99cd17ddb2-x86_64-linux-gnu
2022-03-21 12:06:51 ❤️  by Parity Technologies <admin@parity.io>, 2017-2022
2022-03-21 12:06:51 📋 Chain specification: Polkadot
2022-03-21 12:06:51 🏷  Node name: dot1paritydb
2022-03-21 12:06:51 👤 Role: FULL
2022-03-21 12:06:51 💾 Database: ParityDb at /var/blockchain/chains/polkadot/paritydb/full
2022-03-21 12:06:51 ⛓  Native runtime: polkadot-9180 (parity-polkadot-0.tx12.au0)
2022-03-21 12:06:54 Essential task `txpool-background` failed. Shutting down service.
2022-03-21 12:06:54 Essential task `transaction-pool-task-0` failed. Shutting down service.
2022-03-21 12:06:54 Essential task `transaction-pool-task-1` failed. Shutting down service.
Error:
   0: Error at calling runtime api: Runtime code error: `:code` hash not found
   1: Runtime code error: `:code` hash not found
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                 ⋮ 1 frame hidden ⋮                               
   2: polkadot::main::he63b6329d0694b27
      at <unknown source file>:<unknown line>
   3: std::sys_common::backtrace::__rust_begin_short_backtrace::h0b52984c10824fba
      at <unknown source file>:<unknown line>
   4: main<unknown>
      at <unknown source file>:<unknown line>
   5: __libc_start_main<unknown>
      at <unknown source file>:<unknown line>
   6: _start<unknown>
      at <unknown source file>:<unknown line>
Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.

both on Kusama and Polkadot the DBs were synced from scratch with the release 0.9.17-rc4. Same for the rocksdb servers, but those had no issues updating. Sync servers have the following relevant arguments --unsafe-pruning \ --pruning 1000 \

The bricked DBs are available here:

SHA256 URL:

S3 URI:

  • s3://euw1.parasnaps.io/polkadot_paritydb-experimental_bricked.tar
  • s3://euw1.parasnaps.io/kusama_paritydb-experimental_bricked.tar

HTTP URL:

The working DB that breaks during upgrade are the ones currently available at https://parasnaps.io . These are the exact DBs that broke during upgrade from ~12hours ago.

@bkchr
Copy link
Member

bkchr commented Mar 21, 2022

Cc @arkpar @cheme

@cheme
Copy link
Contributor

cheme commented Mar 21, 2022

Yes it is software bug, the database version in metadata is being upgraded while it should not have, I am looking at it.

@cheme cheme self-assigned this Mar 21, 2022
@cheme cheme added Z0-unconfirmed I3-bug Fails to follow expected behavior. labels Mar 21, 2022
@cheme
Copy link
Contributor

cheme commented Mar 21, 2022

Re-synching is safe. Also manually setting the db_version in metadata to 5 seems to fix the issue. (the fact that it upgrade to 6 is probably from a parsing error and metadata getting rewritten, setting manually the rewritten metadata to 5 on the rebuild metadatat work then).

@cheme
Copy link
Contributor

cheme commented Mar 21, 2022

Ok, that's it, https://github.com/paritytech/substrate/blob/1bd5d786fda99d5a0c27c339226789ac047ea4ae/client/db/src/utils.rs#L279 should not be allowed.
Here what did happen is that the columns definition of transaction storage change, and there was also a new column, so validation did fail resulting in rewrite of metadata with 'db_version=6', which breaks the state access (trying to use salt on key as done with v6).

@tugytur , sorry for the bug.
You don't need to resynch the snaps, editing "metadata" file and setting db_version to 5 should fix this.
Resynching is not useless though (db will be version 6), but really not necessary (a migration could be written at some point to update the db).

@tugytur
Copy link
Contributor Author

tugytur commented Mar 21, 2022

@cheme thanks!

Can confirm that changing db_version to 5 fixes it.

I like to sync from scratch every time there are any changes to the db.
We will provide version 6 snapshots as soon as it's synced.

@PolkachuIntern
Copy link

I confirm that changing the db version from 6 to 5 on metadata file has fixed the issue

@tugytur
Copy link
Contributor Author

tugytur commented Mar 21, 2022

@cheme
I've upgraded one of our Kusama Validators and implemented the db_version fix.
CPU usage was ~30-50% higher than usual. Heartbeat was sent every epoch.

But we were assigned as a Para Validator twice and earned 0 points.

Inbound and outbound request failures (timeout) increased from ~0 to couple thousands request per seconds.
libp2p_out_events_notifications_sizes were constantly growing as well.

rpc call babe_epochAuthorship was hanging as well. Canceled the call after 5mintues of waiting.

In the meantime I finished synching db_version 6.
I've migrated the affected Validator to new db_version snapshot and everything is back to normal. I'll make the snapshot tomorrow publicly available if the Validator runs stable through the night.

@cheme
Copy link
Contributor

cheme commented Mar 21, 2022

Thanks for the feedback.
I got an hypothesis the high usage is not from using paritydb version 5, but from a switch of the polkadot parachain db between 0.9.17 and 0.9.18 (which IIUC requires some additional synching).
I know we ran this on validator node before, but it may have been synched from scratch.
CC @ordian , could the describe issue be related?

@tugytur
Copy link
Contributor Author

tugytur commented Mar 21, 2022

@cheme last update for tonight.
There seems to be a issue with db_version 6 as well. I'm not sure how much this would affect the Validator but I'm switching back to rocksdb for the time being.

After roughly 20 minutes of running mpsc_finality_notification_stream and mpsc_import_notification_stream suddenly keeps growing (grows to over 500 in under 20 mintues) and during that time the Validator doesn't respond to any rpc calls but it keeps syncing. When stopping the service it happens via SIGKILL after the service runs into a timeout. I did one restart and waited to be sure that this isn't a random bug.

Edit: Maybe I jumped to a conclusion here. After the second restart this did not happen again. I'll not switch to rocksdb for now and hope we have more data by the morning.

Edit2: It happened again. I've switched to rocksdb. I've kept the paritydb in case you would like to have a look at it.

@cheme
Copy link
Contributor

cheme commented Mar 22, 2022

@tugytur , if you have log available, it can be interesting (even if I have little hope to see something, I am currently running node without much logged).
I will push the version fix and then ask to rerun the tests we did on validator previously, will also prepare a branch which keep using rocksdb for as parachain db: could be a simple way so spot if issue is related to the switch of db at this level.

@cheme
Copy link
Contributor

cheme commented Mar 22, 2022

I forgot, @tugytur what are your launch parameters?

@tugytur
Copy link
Contributor Author

tugytur commented Mar 22, 2022

@cheme Logging is just the default enabled and there is just the default log messages. Nothing indicating an issue.

Launch parameters:

ExecStart=/usr/local/bin/polkadot \
        --name "XXXX" \
        --validator \
        --unsafe-pruning \
        --pruning 1000 \
        --no-mdns \
        --public-addr=/ip4/X.X.X.X/tcp/30333 \
        --listen-addr=/ip4/X.X.X.X/tcp/30333 \
        --chain=kusama \
        --database=paritydb \
        --prometheus-port 9615 \
        --rpc-port 9933 \
        --ws-port 9944 \
        --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1' \
        --base-path=/var/blockchain

@paulormart
Copy link
Contributor

Hi @cheme, i faced the same issue yesterday and collected some logs, although the full trace was not active. there are some errors logged at the time when the DB crashed.

You can find the logs here, hope it helps.

@cheme
Copy link
Contributor

cheme commented Mar 22, 2022

Thanks a lot, I also got into some issue running on our simulation test net.
End of day I will probably propose to revert to using rocksdb for the parachain db, so using paritydb-experimental could be restored, and the issue fix in a correct way.

Edit: issue observed with simnet are actually not that clear (need to run again).

@gituser
Copy link

gituser commented Apr 7, 2022

I have the same issue but on parity v0.9.18 + RocksDB:

RUST_BACKTRACE=full COLORBT_SHOW_HIDDEN=1 ./polkadot --name polkadot --chain=polkadot -d /home/polkadot/.polkadot --rpc-external --ws-external --rpc-cors all --pruning=100000 --disable-log-color --no-private-ipv4
2022-04-07 12:58:23 Parity Polkadot    
2022-04-07 12:58:23 ✌️  version 0.9.18-ef71ed8bae-x86_64-linux-gnu    
2022-04-07 12:58:23 ❤️  by Parity Technologies <admin@parity.io>, 2017-2022    
2022-04-07 12:58:23 📋 Chain specification: Polkadot    
2022-04-07 12:58:23 🏷  Node name: polkadot    
2022-04-07 12:58:23 👤 Role: FULL    
2022-04-07 12:58:23 💾 Database: RocksDb at /home/polkadot/.polkadot/chains/polkadot/db/full    
2022-04-07 12:58:23 ⛓  Native runtime: polkadot-9180 (parity-polkadot-0.tx12.au0)    
2022-04-07 12:58:43 Essential task `txpool-background` failed. Shutting down service.    
Error: 
   0: Error at calling runtime api: Runtime code error: `:code` hash not found
   1: Runtime code error: `:code` hash not found

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   1: color_eyre::config::EyreHook::into_eyre_hook::{{closure}}::h588582e1cf7d8b5b
      at <unknown source file>:<unknown line>
   2: eyre::capture_handler::h9f982628d969531f
      at <unknown source file>:<unknown line>
   3: polkadot::main::hdaa1adbaf9fcd406
      at <unknown source file>:<unknown line>
   4: std::sys_common::backtrace::__rust_begin_short_backtrace::h2902248ec0ab9496
      at <unknown source file>:<unknown line>
   5: std::rt::lang_start::{{closure}}::h57dcbb249ed8206e
      at <unknown source file>:<unknown line>
   6: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h443f738a8e9f947a
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/ops/function.rs:259
   7: std::panicking::try::do_call::h1e21ba261ba489ec
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:406
   8: std::panicking::try::h6afd48af8b6c96ac
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:370
   9: std::panic::catch_unwind::h85dd95e0bab7fb60
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panic.rs:133
  10: std::rt::lang_start_internal::{{closure}}::h038455e697c8b03e
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/rt.rs:128
  11: std::panicking::try::do_call::h6b0ad65979f3077a
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:406
  12: std::panicking::try::h010108d314169ac6
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:370
  13: std::panic::catch_unwind::hff397f912b1535c2
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panic.rs:133
  14: std::rt::lang_start_internal::h52e73755f77c7dd9
      at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/rt.rs:128
  15: main<unknown>
      at <unknown source file>:<unknown line>
  16: __libc_start_main<unknown>
      at <unknown source file>:<unknown line>
  17: _start<unknown>
      at <unknown source file>:<unknown line>

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.

There is no metadata file at all but there is a db_version file which has 4 (without new line), I've tried setting it to 5 but it says to me that db version is in the future.

Any ideas how to fix this without resync? @cheme

@cheme
Copy link
Contributor

cheme commented Apr 7, 2022

About this issue:

  • there was a versioning bug in paritydb, this was fixed in master. (@paulormart and @tugytur did experience it). The bug does not touch synch from scratch, and there is a work around to avoid resynching.
  • @tugytur did experience some other issue with paritydb, I tried to reproduce it, seen some things, but they were false positives (I did warpsync and peaks were background synch). At this point we are in the process of switching our validators to paritydb to assert this issue, but currently I am not aware of negative feedback from devops.

@gituser in your case this should be a different issue (using rocksdb), the error indicates that some state value (the runtime code) cannot be accessed. My first idea would be that the rocksdb got corrupted somehow, I don't have ideas of how to fix this without resync.
Maybe opening a new issue, other devs can relate to it.

@gituser
Copy link

gituser commented Apr 7, 2022

@cheme
I've pasted here because I had similar error.
The server froze (I think it's something to do with CPU on Ryzen it got stuck) and then Polkadot DB after reset got corrupted.

@cheme
Copy link
Contributor

cheme commented Apr 7, 2022

@cheme I've pasted here because I had similar error. The server froze (I think it's something to do with CPU on Ryzen it got stuck) and then Polkadot DB after reset got corrupted.

Sure totally make sense, I am not saying this should be ignored. This is quite valuable information.
But not related to the initial issue (paritydb) so the db_version workaround cannot work.
Your information about experiencing sever freezing with rocksdb on 0.9.18 is also good feedback.

@alko89
Copy link

alko89 commented Apr 7, 2022

@gituser this seemed to happen to me as well about a week ago with RocksDB, I opened an issue here #5210 but got closed since I can't provide any relevant data

Edit: I'm on Intel CPU

@gituser
Copy link

gituser commented Apr 9, 2022

@alko89 I ended up re-syncing from scratch. I'm on AMD though.

@cheme
Copy link
Contributor

cheme commented May 18, 2022

paritytech/parity-db#68 fix a deadlock that may explain this issue.
Today 0.9.22 release include this fix.
Will close the issue if we observe our rococo validators working properly (not yet upgraded).

@nuke-web3
Copy link
Contributor

@cheme Assuming things ar all good now? Please let me know if not!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug Fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests

8 participants