Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to ParityDB by default; deprecate RocksDB #1792

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

koute
Copy link
Contributor

@koute koute commented Oct 4, 2023

ParityDB seems to be stable and work pretty well, and it's significantly more efficient than RocksDB, so let's finally switch to it by default and get rid of RocksDB.

This PR disables the ability to create new RocksDB databases and makes ParityDB the default database used.

Existing RocksDB databases are still supported and will still work, but will emit a warning about its deprecation.

Here's how the error looks like when starting a Polkadot node with a --database rocksdb where there's no existing RocksDB database present:

2023-10-04 18:46:02 Parity Polkadot
2023-10-04 18:46:02 ✌️  version 1.1.0-d95ce90cbc3
2023-10-04 18:46:02 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023
2023-10-04 18:46:02 📋 Chain specification: Development
2023-10-04 18:46:02 🏷  Node name: boring-brick-3060
2023-10-04 18:46:02 👤 Role: AUTHORITY
2023-10-04 18:46:02 💾 Database: RocksDb at /tmp/db/chains/rococo_dev/db/full
Error:
   0: Backend error: Creating new RocksDB databases is not supported anymore; please use ParityDB

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

And here's the deprecation warning if a RocksDB database already exists:

2023-10-04 21:57:42 Substrate Node
2023-10-04 21:57:42 ✌️  version 3.0.0-dev-0db50256685
2023-10-04 21:57:42 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023
2023-10-04 21:57:42 📋 Chain specification: Development
2023-10-04 21:57:42 🏷  Node name: utter-name-3738
2023-10-04 21:57:42 👤 Role: AUTHORITY
2023-10-04 21:57:42 💾 Database: RocksDb at /tmp/db/chains/dev/db/full
2023-10-04 21:57:45 💀 RocksDB support is deprecated and will be removed on 2024-04-01; please switch to ParityDB
2023-10-04 21:57:45 [0] 💸 generated 1 npos voters, 1 from validators and 0 nominators
2023-10-04 21:57:45 [0] 💸 generated 1 npos targets

@koute koute added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T8-polkadot This PR/Issue is related to/affects the Polkadot network. T9-cumulus This PR/Issue is related to cumulus. T13-deprecation The current issue/pr is, or should be, part of a deprecation process. labels Oct 4, 2023
@koute koute requested review from arkpar and a team October 4, 2023 20:13
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 2/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3884841

)
.map(|_| ())
.map_err(|e| sp_blockchain::Error::Backend(e.to_string()))
crate::utils::open_kvdb_rocksdb::<Block>(db_path, db_type, true, 128)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use parity db in this test?

Comment on lines +40 to +41
let blob = std::io::Cursor::new(include_bytes!("rocksdb_test_database.zip"));
zip::read::ZipArchive::new(blob).unwrap().extract(&tmpdir).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use open_kvdb_rocksdb?

fn main() -> sc_cli::Result<()> {
command::run()
fn main() {
if let Err(error) = command::run() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? To use the Display implementation? Can we not make Debug use Display internally?

@arkpar
Copy link
Member

arkpar commented Oct 5, 2023

Thank you for taking action ;-)

We're currently working on a major feature in paritydb that will change how state data is stored. That might affect stability. I still think we can make the switch now and roll out this new storage method in a fail safe or experimental manner.

This PR disables the ability to create new RocksDB databases

This seems a bit harsh. It will break existing setups that people may have for testing, automation, etc.
I suggest deprecating --db=rocksdb with a similar warning message first, disabling it in a subsequent release.

@bkchr
Copy link
Member

bkchr commented Oct 5, 2023

This seems a bit harsh. It will break existing setups that people may have for testing, automation, etc.

I think it is fine. We want to discourage creating new dbs. You can still use --database rocksdb to open an old db. As long as your stuff continues to work in production, I don't see any problem in this.

@hitchhooker
Copy link
Contributor

took me a month(september) to sync kusama on paritydb with archive mode on 7950x3d + ecc ram + nvme:s in Bangkok, db constantly stopping sync and sync files being. rocksdb syncs in a day or two. that major feature update is said to fix this, but might be something to consider.
sync experience was so painful that i really hope rocksdb aint deprecated too soon

@gituser
Copy link

gituser commented Oct 14, 2023

@koute

ParityDB seems to be stable and work pretty well, and it's significantly more efficient than RocksDB

Can you explain more how it is efficient?
It seems ParityDB takes much more time (slower) compared to RocksDB for initial sync, in terms of storage ParityDB is the same as RocksDB (~300 GB with --pruning=100000), ParityDB uses more host memory compared to RocksDB.

RocksDB imo much more stable compared to ParityDB and it shouldn't be removed or deprecated at this stage unless all these issues are fixed first.

@bkchr
Copy link
Member

bkchr commented Oct 15, 2023

@hitchhooker @gituser please open issues in the parity db repo. We need to fix them. However, rocksdb has its own downsides and we don't want to continue to maintain it.

@koute
Copy link
Contributor Author

koute commented Oct 15, 2023

took me a month(september) to sync kusama on paritydb with archive mode on 7950x3d + ecc ram + nvme:s in Bangkok, db constantly stopping sync and sync files being. rocksdb syncs in a day or two. that major feature update is said to fix this, but might be something to consider.
sync experience was so painful that i really hope rocksdb aint deprecated too soon

Are you sure this was due to ParityDB and not a generic syncing issue related to networking? (I have seen e.g. syncing stopping myself too in the past, but that was also on RocksDB and was unrelated to the database used.)

ParityDB uses more host memory compared to RocksDB

Last time I checked the difference was visibly in favor of ParityDB, unless something changed since that time. From time to time I run performance stress tests where I run a testnet composed of hundreds of nodes on a single machine, and in that test I can run many more nodes when using ParityDB before I hit the memory limit and everything crashes.


Anyway, thank you for your feedback. If you see any problems with ParityDB please make sure to open an issue! If people don't report the problems they see we can't fix them.


@bkchr Perhaps we should go with a little bit softer deprecation here, just in case there are actually still some showstopper problems related to ParityDB?

How about we add a temporary escape hatch, e.g. --allow-deprecated-create-new-rocksdb-database parameter (or something like that) which will allow people to still create new RocksDB databases if absolutely necessary? This will still hopefully push people who have --database rocksdb parameter hardcoded in their node invocations to switch to ParityDB as that will still produce an error, but it case there are actually any showstopper bugs they'll be able to temporarily switch back until we fix the issue.

(So people will have to have --allow-deprecated-create-new-rocksdb-database --database rocksdb to force the node to create a new RocksDB database.)

And if there will be no new issues reported we can just nuke this parameter in the next release or two.

I will also do some testing myself syncing my archive node under ParityDB from scratch a couple of times, just in case.

@kogeler
Copy link

kogeler commented Oct 17, 2023

@bkchr
We are switching from ParityDB to RocksDB for some of our critical nodes right now.

Big DBs can't sync at all if they have a big gap of unsynchronized in the blocks. I met this a lot of times. It is a little bit dangerous in some cases.

ref:

@kogeler
Copy link

kogeler commented Oct 17, 2023

This seems a bit harsh. It will break existing setups that people may have for testing, automation, etc.

I think it is fine. We want to discourage creating new dbs. You can still use --database rocksdb to open an old db. As long as your stuff continues to work in production, I don't see any problem in this.

recently we had had some cases when we needed to sync new DBs from scratch. We couldn't do it using ParityDB. We were forced to use binary DB snapshots.

@kogeler
Copy link

kogeler commented Oct 17, 2023

@koute

I am absolutely sure that the problem is not in the network or in the machines. We use large instances with huge bandwidth with the fastest disks that the large cloud providers can offer. The issue looks like an overloaded disk subsystem. You can check the screenshots of the load charts in my issue.

@arkpar
Copy link
Member

arkpar commented Oct 17, 2023

Archive DBs are indeed slower at the moment on most VPC/VPS machines. This is mostly because there's a lot of small random writes going on. The database hits VPC or hardware IOPS limit. Every inserted trie node results in a random write
Having lots of RAM for page cache helps with performance here.
paritytech/parity-db#199 should resolve this issue by eliminating random writes, but it is a substantial change (mostly on the trie/substrate side). I estimate it to be ready in 4-6 weeks.

@kogeler
Copy link

kogeler commented Oct 18, 2023

Archive DBs are indeed slower at the moment on most VPC/VPS machines. This is mostly because there's a lot of small random writes going on. The database hits VPC or hardware IOPS limit. Every inserted trie node results in a random write Having lots of RAM for page cache helps with performance here. paritytech/parity-db#199 should resolve this issue by eliminating random writes, but it is a substantial change (mostly on the trie/substrate side). I estimate it to be ready in 4-6 weeks.

GCP SSD PD (that I used in the issue) has performance parameters:

  • Read throughput per VM (MB/s) - 1,200
  • Write throughput per VM (MB/s) - 1,200
  • Read IOPS per VM - 100,000
  • Write IOPS per VM - 80,000

Also, I used my own bare-metal instances with local nvme SSD disks. It doesn't matter. Parity DB can not sync current Polkadot and Kusama state archive DBs from scratch.

I feel that the ParityDB database is not stable enough to disable RocksDB.

@kogeler
Copy link

kogeler commented Oct 18, 2023

@kogeler
Copy link

kogeler commented Oct 18, 2023

Also, I created the feature request #219 that is related to the ParityDB snapshots.

It is also relevant because we cannot synchronize the database from scratch and cannot quickly load it from a snapshot in a predictable time at the same time.

It's why we are switching from ParityDB to RocksDB for some of our critical nodes right now.

@hitchhooker
Copy link
Contributor

Archive DBs are indeed slower at the moment on most VPC/VPS machines. This is mostly because there's a lot of small random writes going on. The database hits VPC or hardware IOPS limit. Every inserted trie node results in a random write Having lots of RAM for page cache helps with performance here. paritytech/parity-db#199 should resolve this issue by eliminating random writes, but it is a substantial change (mostly on the trie/substrate side). I estimate it to be ready in 4-6 weeks.

My setup is designed to reach 250k IOPS read/write per TB at hardware level, but I think ZFS as a filesystem might be the limiting factor for random writes(about 2.5x slower performance than lvm/ext4 for random writes in benchmark).

@arkpar
Copy link
Member

arkpar commented Oct 18, 2023

ZFS uses 128k IO pages by default, which is suboptimal for DB workloads, especially ones involving small random writes.
https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#database-workloads
For parity-db ZFS recordsize option should be set to a minimum. E.g. 4k

@hitchhooker
Copy link
Contributor

hitchhooker commented Oct 18, 2023

I think main issue for poor performance is just copy on write. Have been running with 16k recordsize for paritydb without really improving benchmarks. Will give 4k a try.

disks=("nvme1n1" "nvme2n1" "nvme3n1" "nvme4n1")
zpool create -o ashift=12 tank $(for disk in "${disks[@]}"; do echo "/dev/${disk}p2"; done)

# Disable access time (atime) as it can negatively impact performance
zfs set atime=off tank
# Set recordsize to 16K as most values in the ParityDb are small and values over 16K are rare
zfs set recordsize=16k tank
# throughput safer than latency
zfs set logbias=throughput tank
# Set the primary cache to only metadata, as ParityDb relies on the OS page cache
zfs set primarycache=metadata tank
# Enable compression as it can provide both space and performance benefits
zfs set compression=lz4 tank
# Set redundant metadata to most to protect against data corruption
zfs set redundant_metadata=most tank
# Synchronous writes (sync) should be set to standard to ensure data integrity in case of an unexpected shutdown
zfs set sync=standard tank

echo "Finished setting up ZFS pool and swap partitions"

@ggwpez
Copy link
Member

ggwpez commented Oct 18, 2023

I started a Kusama archive sync on GCP server ref-hw-ggwpez so that anyone from Parity can SSH into from the web-UI.
It is currently at block 2M and already doing 17k write IOPS. It has 4 local NVMEs with a combined 680K/360K read/write IOPS.

@bkchr
Copy link
Member

bkchr commented Oct 18, 2023

How about we add a temporary escape hatch, e.g. --allow-deprecated-create-new-rocksdb-database parameter (or something like that) which will allow people to still create new RocksDB databases if absolutely necessary? This will still hopefully push people who have --database rocksdb parameter hardcoded in their node invocations to switch to ParityDB as that will still produce an error, but it case there are actually any showstopper bugs they'll be able to temporarily switch back until we fix the issue.

(So people will have to have --allow-deprecated-create-new-rocksdb-database --database rocksdb to force the node to create a new RocksDB database.)

Good idea! But just let's do it by --database rocksdb-deprecated-whatever.

@hitchhooker
Copy link
Contributor

what I have noticed that zfs is really terrible.
image
single nvme with ext4

image
2x these nvme in striped zfs with 4k recordsize

so I doubt that actually the issues with syncing archive paritydb databases are related to use of zfs/btrfs instead just lvm/ext4. I guess copy on write slows down random writes, so probably good idea to recommend to avoid usage of zfs/btrfs for database disks. I think @kogeler had also some alternative filesystem in use.

hitchhooker pushed a commit to ibp-network/polkadot-sdk that referenced this pull request Jun 10, 2024
11279 ± ./scripts/test_bootnodes.sh                                                                                                                   [0m]! ✹ ✭
Initialized output directory and JSON file.
relaychain: kusama
/usr/local/bin/polkadot-parachain --no-hardware-benchmarks --no-mdns -d /tmp/bootnode_data --chain ./chain-spec/people-kusama.json --relay-chain-rpc-urls wss://rpc.ibp.network/kusama --bootnodes /dns/boot-node.helikon.io/tcp/7510/p2p/12D3KooWM1X4setrMWjwnV8iDkAtYhqFHNkGozdWdq6sawWh5Yhv
2024-06-10 12:41:21 [Parachain] ⚙️  Syncing 115.2 bps, target=#175337 (3 peers), best: paritytech#1792 (0x8e56…b18a), finalized #0 (0xc1af…8b3f), ⬇ 943.9kiB/s ⬆ 10.7kiB/s 
helikon people-kusama peer_count: 3
relaychain: kusama
/usr/local/bin/polkadot-parachain --no-hardware-benchmarks --no-mdns -d /tmp/bootnode_data --chain ./chain-spec/people-kusama.json --relay-chain-rpc-urls wss://rpc.ibp.network/kusama --bootnodes /dns/boot-node.helikon.io/tcp/7512/wss/p2p/12D3KooWM1X4setrMWjwnV8iDkAtYhqFHNkGozdWdq6sawWh5Yhv
2024-06-10 12:41:47 [Parachain] ⚙️  Syncing 115.2 bps, target=#175339 (2 peers), best: #1024 (0x2970…7678), finalized #0 (0xc1af…8b3f), ⬇ 541.6kiB/s ⬆ 14.4kiB/s 
helikon people-kusama peer_count: 2
All data has been fetched and saved.
hitchhooker pushed a commit to ibp-network/polkadot-sdk that referenced this pull request Jun 10, 2024
11279 ± ./scripts/test_bootnodes.sh                                                                                                                   [0m]! ✹ ✭
Initialized output directory and JSON file.
relaychain: kusama
/usr/local/bin/polkadot-parachain --no-hardware-benchmarks --no-mdns -d /tmp/bootnode_data --chain ./chain-spec/people-kusama.json --relay-chain-rpc-urls wss://rpc.ibp.network/kusama --bootnodes /dns/boot-node.helikon.io/tcp/7510/p2p/12D3KooWM1X4setrMWjwnV8iDkAtYhqFHNkGozdWdq6sawWh5Yhv
2024-06-10 12:41:21 [Parachain] ⚙️  Syncing 115.2 bps, target=#175337 (3 peers), best: paritytech#1792 (0x8e56…b18a), finalized #0 (0xc1af…8b3f), ⬇ 943.9kiB/s ⬆ 10.7kiB/s
helikon people-kusama peer_count: 3
relaychain: kusama
/usr/local/bin/polkadot-parachain --no-hardware-benchmarks --no-mdns -d /tmp/bootnode_data --chain ./chain-spec/people-kusama.json --relay-chain-rpc-urls wss://rpc.ibp.network/kusama --bootnodes /dns/boot-node.helikon.io/tcp/7512/wss/p2p/12D3KooWM1X4setrMWjwnV8iDkAtYhqFHNkGozdWdq6sawWh5Yhv
2024-06-10 12:41:47 [Parachain] ⚙️  Syncing 115.2 bps, target=#175339 (2 peers), best: #1024 (0x2970…7678), finalized #0 (0xc1af…8b3f), ⬇ 541.6kiB/s ⬆ 14.4kiB/s
helikon people-kusama peer_count: 2
All data has been fetched and saved.
@timorleph
Copy link

Hi, is making ParityDB default something you are still working on? If so, when would you expect this to be done?

We are wondering which DB we should use, and knowing what your plans related to this PR are would be helpful.

@bkchr
Copy link
Member

bkchr commented Jul 16, 2024

We're currently working on a major feature in paritydb that will change how state data is stored. That might affect stability. I still think we can make the switch now and roll out this new storage method in a fail safe or experimental manner.

We are still waiting for this.

@arkpar
Copy link
Member

arkpar commented Jul 17, 2024

The parity-db feature is complete and the integration into substrate was pending testing with this PR: #4176
@MattHalpinParity is taking over the effort to get it merged.

@timorleph
Copy link

So I assume #4176 is the sole blocker here – do you have an (approximate ofc) expected date of delivery? Asking mostly since the last activity there is from a couple months ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T8-polkadot This PR/Issue is related to/affects the Polkadot network. T9-cumulus This PR/Issue is related to cumulus. T13-deprecation The current issue/pr is, or should be, part of a deprecation process.
Projects
Status: code in review
Development

Successfully merging this pull request may close these issues.

9 participants