Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: db size increase too fast #451

Closed
yihuang opened this issue May 3, 2022 · 64 comments
Closed

Problem: db size increase too fast #451

yihuang opened this issue May 3, 2022 · 64 comments
Assignees

Comments

@yihuang
Copy link
Collaborator

yihuang commented May 3, 2022

investigate to see if there are low hanging fruits to reduce the db size.

@yihuang
Copy link
Collaborator Author

yihuang commented May 3, 2022

For reference:

939G	application.db
42G	blockstore.db
1.0G	cs.wal
46M	evidence.db
4.0K	priv_validator_state.json
47M	snapshots
81G	state.db
238G	tx_index.db

@yihuang
Copy link
Collaborator Author

yihuang commented May 3, 2022

Remove tx_index.db

Currently we rely on tx indexer to query tx by eth tx hash, an alternative solution is to store that index in a standalone kv db in app side, so we don't need to retain all the tx indexes.

@JayT106
Copy link
Collaborator

JayT106 commented May 3, 2022

RocksDB uses snippy as the default compression algorithm, We can use LZ4 or other more aggressive(but may more resource consuming) algorithms as its compression option.
ref:
https://github.com/facebook/rocksdb/wiki/Compression
https://github.com/tendermint/tm-db/blob/d24d5c7ee87a2e5da2678407dea3eee554277c83/rocksdb.go#L33

@JayT106
Copy link
Collaborator

JayT106 commented May 3, 2022

Remove tx_index.db

Currently we rely on tx indexer to query tx by eth tx hash, an alternative solution is to store that index in a standalone kv db in app side, so we don't need to retain all the tx indexes.

yap, we should consider using a new kvstore just for storing the tx hash mapping. Also we can disable the Tendermint indexer for increasing the consensus performance.

@adu-crypto
Copy link
Contributor

Remove tx_index.db

Currently we rely on tx indexer to query tx by eth tx hash, an alternative solution is to store that index in a standalone kv db in app side, so we don't need to retain all the tx indexes.

you mean nodes could choose not to have this tx_index.db by moving this part off-chain?

@yihuang
Copy link
Collaborator Author

yihuang commented May 4, 2022

Remove tx_index.db

Currently we rely on tx indexer to query tx by eth tx hash, an alternative solution is to store that index in a standalone kv db in app side, so we don't need to retain all the tx indexes.

you mean nodes could choose not to have this tx_index.db by moving this part off-chain?

yes, by store the eth tx hash index in another place.

@JayT106
Copy link
Collaborator

JayT106 commented May 4, 2022

I will start a testing build with the custom RocksDB setup to see how good it can be improved

@yihuang
Copy link
Collaborator Author

yihuang commented May 5, 2022

# IndexEvents defines the set of events in the form {eventType}.{attributeKey},
# which informs Tendermint what to index. If empty, all events will be indexed.
#
# Example:
# ["message.sender", "message.recipient"]
index-events = []

There's an option in app.toml to fine turn what events to index.

The minimal one for json-rpc to work should be:

index-events = ["ethereum_tx.ethereumTxHash", "ethereum_tx.txIndex"]

EDIT: ethereum_tx.txIndex is necessary too.

@JayT106
Copy link
Collaborator

JayT106 commented May 6, 2022

For reference:

939G	application.db
42G	blockstore.db
1.0G	cs.wal
46M	evidence.db
4.0K	priv_validator_state.json
47M	snapshots
81G	state.db
238G	tx_index.db

Which block height was observed in this DB scale?

@JayT106
Copy link
Collaborator

JayT106 commented May 9, 2022

Looks like lz4 might be working, the currentapplication.db of the testing node at the block height of 1730K is around 511G. Projecting today's block height(2692K) will be around 755G. And at the same time, the application.db of the full node with rocksDB using snippy is 1057G. roughly 25% space-saving.

Wait until the testing node fully syncs up to the network and see the final result.

@yihuang
Copy link
Collaborator Author

yihuang commented May 17, 2022

@tomtau mentioned we could do some statistics on the applications.db to see what kinds of data occupy most space, then see if there's any waste can be saved in the corresponding modules. For example, iterate the iavl tree, and sum the value lengths of each module prefixes.

@yihuang
Copy link
Collaborator Author

yihuang commented May 17, 2022

BTW, this is prunning=default node size (thanks @allthatjazzleo):

535G	/chain/.cronosd/data/application.db
20K	/chain/.cronosd/data/snapshots
44G	/chain/.cronosd/data/blockstore.db
120G	/chain/.cronosd/data/state.db
312G	/chain/.cronosd/data/tx_index.db
20K	/chain/.cronosd/data/evidence.db
1023M	/chain/.cronosd/data/cs.wal
1011G	/chain/.cronosd/data/

Compared to full archive one:

1.1T	/chain/.cronosd/data/application.db
79M	/chain/.cronosd/data/snapshots
47G	/chain/.cronosd/data/blockstore.db
90G	/chain/.cronosd/data/state.db
260G	/chain/.cronosd/data/tx_index.db
78M	/chain/.cronosd/data/evidence.db
1.1G	/chain/.cronosd/data/cs.wal
1.5T	/chain/.cronosd/data/

@JayT106
Copy link
Collaborator

JayT106 commented May 17, 2022

BTW, this is prunning=default node size (thanks @allthatjazzleo):

535G	/chain/.cronosd/data/application.db
20K	/chain/.cronosd/data/snapshots
44G	/chain/.cronosd/data/blockstore.db
120G	/chain/.cronosd/data/state.db
312G	/chain/.cronosd/data/tx_index.db
20K	/chain/.cronosd/data/evidence.db
1023M	/chain/.cronosd/data/cs.wal
1011G	/chain/.cronosd/data/

Compared to full archive one:

1.1T	/chain/.cronosd/data/application.db
79M	/chain/.cronosd/data/snapshots
47G	/chain/.cronosd/data/blockstore.db
90G	/chain/.cronosd/data/state.db
260G	/chain/.cronosd/data/tx_index.db
78M	/chain/.cronosd/data/evidence.db
1.1G	/chain/.cronosd/data/cs.wal
1.5T	/chain/.cronosd/data/

the pruning=default only keeps last 100 states, so it will be good for running the node without query functions.

@JayT106
Copy link
Collaborator

JayT106 commented May 17, 2022

Got the testing node synced up to the plan upgrade height,
using default:

1057776M	./application.db
45714M	./blockstore.db
88630M	./state.db

using lz4

1058545M	./application.db
47363M	./blockstore.db
88633M	./state.db

It meets the benchmark in this article. There is no gain from the compression ratio, only gains from the compression/decompression speed
https://morotti.github.io/lzbench-web/?dataset=canterbury/alice29.txt&machine=desktop

@tomtau
Copy link
Contributor

tomtau commented May 18, 2022

why is state.db larger in the pruned one? (120GB vs 90GB)

@JayT106
Copy link
Collaborator

JayT106 commented May 26, 2022

Went through the application.db, Got some basic statistic numbers (at height 2933002 and the size is raw data length):
evm and ibc module use major store space in the database which is not surprised, will look at more details in these modules

evm
~24.6M kv pairs, keySizeTotal: ~1.3G, valueSizeTotal: ~976M, avg key size:52, avg value size:39
ibc
~2.6M kv pairs, keySizeTotal: ~149M, valueSizeTotal: ~58M, avg key size:57, avg value size:22

@yihuang
Copy link
Collaborator Author

yihuang commented May 26, 2022

Another thing related is, in v0.6.x we had a minor issue that contract suicide don't really delete the code and storage, not sure how much impact does that have on the db size though.

@yihuang
Copy link
Collaborator Author

yihuang commented May 26, 2022

it feels that ibc shouldn't store so many pairs, can you see the prefixes?

@JayT106
Copy link
Collaborator

JayT106 commented May 26, 2022

the major Key patterns in ibc store:

acks/ports/transfer/channels/channel-0/sequences/... counts 1003777
receipts/ports/transfer/channels/channel-0/sequences/... counts 1003777
clients/07-tendermint-1/consensusStates/... counts 403893
636C69656E74732F30372D74656E6465726D696E742D31... (hex code of clients/07-tendermint-1) counts 134631

@tomtau
Copy link
Contributor

tomtau commented May 27, 2022

I guess some historical (i.e. older than "evidence age") states, acks, receipts... could be pruned from ibc application storage?
Do you have a more detailed breakdown of evm?

@yihuang
Copy link
Collaborator Author

yihuang commented May 27, 2022

https://github.com/cosmos/ibc-go/blob/release/v2.2.x/modules/light-clients/07-tendermint/types/update.go#L137
for the consensusStates, there's a pruning logic, but it only deletes at most one item at a time. we might need to check how many expired ones are currently.

the sequence keys don't seem prune at all.

@JayT106
Copy link
Collaborator

JayT106 commented May 27, 2022

Do you have a more detailed breakdown of evm?

working on it,
The evmstore stores:
1: code, the key will be the prefix 01 + codehash (this part should be fine)
2: storage, the key will be the prefix 02 + eth account address + hash of something (trying to figure out)

@yihuang
Copy link
Collaborator Author

yihuang commented May 27, 2022

The EVM module's storage schema is much simpler, contract code and storage, and the storage slots are calculated by evm internally, I guess there's not much to prune there.

@yihuang
Copy link
Collaborator Author

yihuang commented May 27, 2022

2: storage, the key will be the prefix 02 + eth account address + hash of something (trying to figure out)

it's the storage slot number, computed by evm internal.

@JayT106
Copy link
Collaborator

JayT106 commented May 27, 2022

2: storage, the key will be the prefix 02 + eth account address + hash of something (trying to figure out)

it's the storage slot number, computed by evm internal.

in the storage part, the address 1359135B1C9EB7393F75271E9A2B72FC0D055B2E has 382381 kv pairs, so does it store that much slots?
https://cronos.org/explorer/address/0x1359135B1C9Eb7393f75271E9a2b72fc0d055B2E/transactions

@JayT106
Copy link
Collaborator

JayT106 commented Jun 6, 2022

There is an orphan data(the historical node data) in the DB, I will calculate the size.

Iterating the db with all orphans: 3,394,612,961,totalKeySize: 166G, totalValueSize: 108.6G
I think there are some intermediate nodes in the other version that haven't been calculated.

can you check how large the DB is if we pruning all the history versions?
sure, will check it.

EVM traces are important in the historical storage, so that's something to look at: how much is consumed by them, and if it's taken most of the storage, whether they can be stored in a more resourceful way

From https://geth.ethereum.org/docs/dapp/tracing
Geth will regenerate the desired state by replaying blocks from the closest point in time before B where it has full state. This defaults to 128 blocks max, but you can specify more in the actual call ... "reexec":1000 .. } to the tracer.

Maybe we can define the proper state pruning interval to make sure the tx re-executing is not too heavy to the node?

@tomtau
Copy link
Contributor

tomtau commented Jun 7, 2022

Iterating the db with all orphans: 3,394,612,961,totalKeySize: 166G, totalValueSize: 108.6G
I think there are some intermediate nodes in the other version that haven't been calculated.

Not sure what the keys are like, but if the slot keys, bucketing or sha3(contract address, slot) -> value could perhaps help.

Maybe we can define the proper state pruning interval to make sure the tx re-executing is not too heavy to the node?

@JayT106 you can try syncing with a different config and consult with @CeruleanAtMonaco @jun0tpyrc if that config will still be all right for dApps or exchanges that need a full or archival node-like setting.
I'd also suggest to create a test environment for experimenting with the code change impact on the storage, given it seems it's hard to tell:

  1. export mainnet transactions up to a certain block height (I assume the upgrade height could give a perspective on the current storage usage)
  2. create some experimental binaries forked off cronosd that will:
  • be able to replay the exported transactions (without consensus) and store the resulting state
  • have a patch of Ethermint that could reduce the storage size growth: slot storage changed, using v0.46.0-rc1 Cosmos SDK and its "V2" storage...

@JayT106
Copy link
Collaborator

JayT106 commented Jun 7, 2022

the prunning=default keeps the last 100 states of every 500th block, which is 3076735 / 500 + 100 = 6253 for mainnet, the db size is still more than 500G. can you check how large the DB is if we pruning all the history versions?

I think the comments on the app.toml is misleading. Checked in the SDK v0.44.6 & v0.45.4
The default pruning is:

PruneDefault = NewPruningOptions(362880, 100, 10)

meaning the app will keep the latest 362880 versions (around 21 days by 5 secs block time), and then only keep 1 version for every 100 blocks past the keepRecent period( the rest will be put into the pruning list), and then execute the pruning every 10 blocks.

Also, the settings are only affecting the current DB, meaning it will not prune the past version far over the keepRecent period if the node was set prune nothing

@JayT106
Copy link
Collaborator

JayT106 commented Jun 7, 2022

Not sure what the keys are like, but if the slot keys, bucketing or sha3(contract address, slot) -> value could perhaps help.

I think it's the slot keys, the past states (changed or removed) relates to the contract addresses.

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 7, 2022

the prunning=default keeps the last 100 states of every 500th block, which is 3076735 / 500 + 100 = 6253 for mainnet, the db size is still more than 500G. can you check how large the DB is if we pruning all the history versions?

I think the comments on the app.toml is misleading. Checked in the SDK v0.44.6 & v0.45.4

The default pruning is:


PruneDefault = NewPruningOptions(362880, 100, 10)

meaning the app will keep the latest 362880 versions (around 21 days by 5 secs block time), and then only keep 1 version for every 100 blocks past the keepRecent period( the rest will be put into the pruning list), and then execute the pruning every 10 blocks.

Also, the settings are only affecting the current DB, meaning it will not prune the past version far over the keepRecent period if the node was set prune nothing

Does that mean a more aggressive pruning setting can drop the db size a lot?
At least validator nodes can use a much more aggressive pruning setting.

@JayT106
Copy link
Collaborator

JayT106 commented Jun 7, 2022

Does that mean a more aggressive pruning setting can drop the db size a lot?
At least validator nodes can use a much more aggressive pruning setting.

the validator node should be able to use prune everything for minimizing the disk usage because they only need to perform new block execution. The setting is:
PruneEverything = NewPruningOptions(2, 0, 10)

But a problem will be they cannot be a reference node of the snapshot.
maybe we can do NewPruningOptions(2, 600, 10) for keeping a state roughly every hour?

I will ask Devops to set up a testing node to see the result of different pruning setups.

@calvinaco
Copy link
Contributor

But a problem will be they cannot be a reference node of the snapshot.

What is the significance for a validator to be a ref. node of the snapshot? If we are pushing validator to its max storage efficiency I think they don't need to be that?

@JayT106
Copy link
Collaborator

JayT106 commented Jun 7, 2022

What is the significance for a validator to be a ref. node of the snapshot? If we are pushing validator to its max storage efficiency I think they don't need to be that?

If our validator doesn't need to be a ref. node of the snapshot. Then yes, we can prune everything. But the outside validators will need to know the difference in setting up the different pruning options (perhaps they use it for a ref. node).
The point is the node operator needs to know the purpose to set prune-everything and the consequence.

@JayT106
Copy link
Collaborator

JayT106 commented Jun 7, 2022

@tomtau
by checking the ADR-040 status, I didn't see they mentioned the V2 storage has solved the storage size issue. On the contrary, it might cause key overhead. Moreover, I feel like the current implementation is not feature-complete. @adu-crypto could you share your insight on it?

However, I agree with you we can test it with the new SDK to see the impact.

@tomtau
Copy link
Contributor

tomtau commented Jun 8, 2022

@tomtau by checking the ADR-040 status, I didn't see they mentioned the V2 storage has solved the storage size issue. On the contrary, it might cause key overhead. Moreover, I feel like the current implementation is not feature-complete. @adu-crypto could you share your insight on it?

However, I agree with you we can test it with the new SDK to see the impact.

Possibly, there are two things to verify:

  1. it should have less overhead in terms of the intermediate nodes and metadata
  2. try a more "custom" usage of ADR-040 to mimic go-ethereum's storage: Problem: db size increase too fast #451 (comment) (SC to store root hashes, SS to store slot mappings)

use low level db snapshots for versioning
don’t rely on the versioning ability of the Sparse Merkle tree?
and we can just store the contract storage in the Sparse Merkle tree
if we rely on low level db snapshots, we may be able to extent it to do whatever we want on the database, still enjoy the atomicity and versioning
like adopt the ethereum-like implementation, just put on a different column family

@adu-crypto
Copy link
Contributor

adu-crypto commented Jun 8, 2022

we've refactored a chain (evmos) to run on the testnet; now have to refactor it again, since 0.46 is reverting back to ante-handlers

currently they are refactoring evmos to test on it.
speaking of feature-completion, I think current store v2 module in main branch is not compatible with the old store v1, which means you need more effort to replace store v1 with store v2 in baseapp I think.
besides as I known, they are also working on the compatibility with IBC bridge.
Speaking of db size issue, the store v2 has two extra data buckets for each substore to escape smt traversing, but if it should have less overhead in terms of the intermediate nodes and metadata as @tomtau mentioned, I think we need to benchmark to see the real impact.

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 8, 2022

But a problem will be they cannot be a reference node of the snapshot.

Can we can keep the snapshot interval identical to pruning interval, so we can keep snapshot working while maximize pruning.

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 8, 2022

From https://geth.ethereum.org/docs/dapp/tracing
Geth will regenerate the desired state by replaying blocks from the closest point in time before B where it has full state. This defaults to 128 blocks max, but you can specify more in the actual call ... "reexec":1000 .. } to the tracer.

Sounds like a good idea, but need to do in cosmos-sdk level: cosmos/cosmos-sdk#12183

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 10, 2022

FYI, @garyor has synced a pruned node which only keep the recent 50 versions, the db size is:

1.0G	data/cs.wal
99G	data/state.db
54G	data/blockstore.db
16K	data/lost+found
295G	data/tx_index.db
12G	data/application.db
74M	data/snapshots
9.5M	data/evidence.db
458G	data/

noticeably, application.db is only 12G.
he'll try the index-events setting to see how that help with the tx_index.db, will know that a week later. 🤦‍♂️

@JayT106
Copy link
Collaborator

JayT106 commented Jun 10, 2022

if we want more aggressive, we can set min-retain-blocks to 50 or something, it will prune state.db and blockstore.db.
But it is probably only for the validator's setup.

I am working on the storage migrating from V1 to V2 and calculating the size.

@JayT106
Copy link
Collaborator

JayT106 commented Jun 14, 2022

To migrate from store V1 to V2 might have a big problem. We have 24M+ kv pairs in the latest version (2933002) of my testing evm module. It took 4 days and only migrated ~1/3 kv pairs (still ongoing).
I wonder if there is a big performance downgrade when SMT is inserting a huge amount of the kv pairs. Because the migration is iterating the kv pairs of the v1store and then setting kv to the substore of the v2store, submitting the changes in the end.
https://github.com/cosmos/cosmos-sdk/blob/ecdc68a26bd4bd5482bb0ffe72999bdb74dcbd94/store/v2alpha1/multi/migration.go#L57

However, I tested a small amount of dataset to evaluate the DB size between storeV1 and storeV2, looks like it helps when we ignore the rocksDB's checkpoints
cosmos/cosmos-sdk#12251

I will try to scale the simulation to see the different numbers. But we might need to dig the SMT implementation to see why the migration takes so long? @adu-crypto any idea?

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 14, 2022

But using v2 store without creating checkpoints for every version don't support query historical state, right?

@tomtau
Copy link
Contributor

tomtau commented Jun 15, 2022

@JayT106 how about with go-leveldb? with that SC/SS separation, maybe go-leveldb still may be viable for RPC nodes?

@JayT106
Copy link
Collaborator

JayT106 commented Jun 15, 2022

But using v2 store without creating checkpoints for every version don't support query historical state, right?

That sounds right, but I don't understand why checkpoints took so many disk spaces, could it be possible some implementation issue in it or my test setup is incorrect?

how about with go-leveldb?

The SDK v0.46.x didn't implement the go-leveldb for supporting the v2 store.

@adu-crypto
Copy link
Contributor

To migrate from store V1 to V2 might have a big problem. We have 24M+ kv pairs in the latest version (2933002) of my testing evm module. It took 4 days and only migrated ~1/3 kv pairs (still ongoing). I wonder if there is a big performance downgrade when SMT is inserting a huge amount of the kv pairs. Because the migration is iterating the kv pairs of the v1store and then setting kv to the substore of the v2store, submitting the changes in the end. https://github.com/cosmos/cosmos-sdk/blob/ecdc68a26bd4bd5482bb0ffe72999bdb74dcbd94/store/v2alpha1/multi/migration.go#L57

However, I tested a small amount of dataset to evaluate the DB size between storeV1 and storeV2, looks like it helps when we ignore the rocksDB's checkpoints cosmos/cosmos-sdk#12251

I will try to scale the simulation to see the different numbers. But we might need to dig the SMT implementation to see why the migration takes so long? @adu-crypto any idea?

@JayT106 currently, smt has a slow down of more than an order of mgnitude speaking of write performance.
this is fully benchmarked and discussed:
cosmos/cosmos-sdk#11328 (comment)
cosmos/cosmos-sdk#11444
they have done some optimizations like removing the smt level key -> value cache and not recomputing root hash every time util commit and so on.
they even have a new smt implementation.
I think this is possibly the biggest blocker for smt migration now.

@tomtau
Copy link
Contributor

tomtau commented Jun 15, 2022

some ideas from turbogeth/erigon: https://github.com/ledgerwatch/erigon#more-efficient-state-storage
https://github.com/ledgerwatch/erigon/blob/devel/docs/programmers_guide/db_walkthrough.MD

@JayT106
Copy link
Collaborator

JayT106 commented Jun 21, 2022

So the main issue I think is the IAVL+ tree design. Say the EVM module storage at the current cronos scale, the tree height is around 25 at the recent version. Each leaf modification will cause 24 intermediate nodes to be updated, therefore, the previous version of the intermediate nodes will keep in the database, and will have new 24 intermediate nodes for the new version of the EVM store.

So we can say the writing amplification for each tree operation is related to the height of the tree. It is very different from the Ethereum trie implementation.

Because the SMT is still not mature. Should we proceed with some workaround to downscale the tree height:
i.e. @yihuang 's idea evmos/ethermint#1099 or
like the go-ethereum storage design #451 (comment). Creates an IAVL tree (or Ethereum Merkle Trie) for each EVM account?

Or we should look at the Cosmos SDK new SMT implementation to see anything we can contribute and then might be able to use it earlier?

@tomtau
Copy link
Contributor

tomtau commented Jun 27, 2022

Or we should look at the Cosmos SDK new SMT implementation to see anything we can contribute and then might be able to use it earlier?

Yes, I think fixing the issues in the SMT implementation is the way to proceed (plus also trying how the SMT implementation could be best leveraged in the Ethermint context, see the "custom" usage note: #451 (comment) to mimic go-ethereum).

Two related notes / issues:

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 27, 2022

Several updates:

  • Can't run v0.6 fix-unlucky-tx with pruning, because it needs to replay old tx.

    • We need to sync without pruning, run fix-unlucky-tx, then prune manually.
  • Using a minimal index-events config reduces tx-index.db size to half (138G at 2693800), it still needs to store the full tx result, so it can't be reduced further by removing indexes. So I think the ethermint custom tx indexer should be much better, I will try that soon.

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 30, 2022

evmos/ethermint#1121 (comment)

The tx indexer db size reduction is very good with the custom eth tx indexer.

@yihuang
Copy link
Collaborator Author

yihuang commented Feb 13, 2023

I think we can close this one now in favor of the tracking issue #869 of the development progress @tomtau

@yihuang yihuang closed this as completed Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants