-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement inlining small values for FlatState #8243
Comments
…8729) We plan introducing small value inlining as a next step in Flat Storage, see #8243 for more details. This is not a part of MVP which means it will not be included in the first Flat Storage release. This PR wraps `ValueRef` in `FlatStateValue` enum, so later we can simply add another variant for inlined value there without changing the format on disk. This significantly simplifies migration in the future.
On-disk inlining size thresholdGeneral considerationsWe want to inline as many flat storage values as possible to make both runtime and state sync faster. Inlining is triggered when the size of the value is lower than or equal to ( Setting threshold too high results the following security issue: Another consideration is space increase caused by inlining large values. mainnet dataSome insights into our data across all shards:
Percentiles of values sizes:
Inlined values ratio and space overhead for some thresholds (
ConclusionSetting the threshold 4KiB results in inlining 99.8% of values with ~27% values memory overhead which is acceptable. Update: @jakmeier pointed out that we have some value encoding overhead, so let's set limit to 4000 bytes instead of 4KiB to be on the safe side. |
Great summary and promising results! One small nit/question. I see you tested with 4096B. I assume that's the size of the value, not its encoding? If so, would it make sense to make the value limit slightly less than 4KiB to account for the borsh encoding overhead? For example, 4KB (4000B) rather than 4096B? (Borsh uses a |
Thanks @jakmeier, that is a very good point! I agree, let's use 4000 bytes instead, I've updated my comment as well. |
Another question, this is only for flat storage access, right? In other words, the |
@jakmeier yes, this only applies to flat storage i.e. |
Migration for existing dataCurrently all FlatState data is stored in rocksdb as ChallengesData migration requires reading value for all Migration with rocksdb merge operatorThe initial idea was to implement existing data migration with rocksdb merge operator. We simply iterate over all entries in flat storage and issue Unfortunately this approach doesn't work for deleted values, see failing We could solve that by introducing a special empty value and then filter it on reads (similar to how we currently handle refcounting). Those values can be then removed as a second step of the migration. In the next release we can remove the filtering since we know that all empty values were deleted. In this comment Robin also mentioned another issues with this approach:
Migration with temporary pausing FlatState updateAnother approach is to make sure FlatState is not updated in between reading and writing value for inlining. This can be achieved by temporary skipping flat head updates. This is safe to pause flat head updates since we accumulate all changes in deltas which will be committed eventually. Implementation is as follows:
Migration with staging columnAs described by Robin in this comment |
Part of #8243. This PR implements value inlining MVP for state sync: * add `FlatStateValue::Inlined` variant to store inlined values as part of `FlatState` and `FlatStateChanges` on disk. * change flat storage API to return `FlatStateValue` instead of `ValueRef`. The following will be implemented separately: * Migration for existing `FlatState` values. This is required for state sync, but quite involved, so decided to keep it separately. * Inlining for cached flat state deltas: for now we keep those as `ValueRef`. * Using inlined values for transaction processing: for now we convert inlined values to `ValueRef`.
I think the merge operator approach you've come up with is very creative. However once the merge operator is defined it should always stick around... There's no way to "flatten it out" unless you iterate over everything and call "set" to ensure no merges ever exist again. For the deletion case, I suppose you could solve that by distinguishing between a merge and a set. So a value would be either a ValueRef, a Value, or a ValueMerge. You'd only merge ValueMerge during migrations. The merge operator is defined as, if the base is ValueRef, then you expect ValueMerges and the logic is what you implemented except the returned value is a Value. If the base is a Value that means the new code has already overwritten it so you just return the same Value. If the base is a ValueMerge then you know that it was supposed to be deleted and so you return None. In addition, after reading the value of the result is a ValueMerge then you have to interpret it as if it were a None. But the problem is this is hacky and polluting; hacky as in you have to invent a format for serializing Value and ValueMerge so that they can be distinguished from a serialized ValueRef whose format you can't change. Polluting as in it's hard to get rid of ValueMerge unless you do another data migration. And you have to treat ValueMerge specially. Anyway, I feel like we are stretching the limitation of merge operator. I think I prefer the alternative design. |
Perhaps also you could do something similar to the pause-writes idea but do it closer to the db level. Introduce a new column family P where all new flat state updates should be written to. Readers should read from P first and if an entry exists then return that instead. Run the async migration on the old column family. Once that's done, schedule an offline migration to iterate through P and apply it as updates to the original column family, and delete P. It's like doing a rebase. Not a well fleshed out idea. |
Part of #8243. This PR implements value inlining MVP for state sync: * add `FlatStateValue::Inlined` variant to store inlined values as part of `FlatState` and `FlatStateChanges` on disk. * change flat storage API to return `FlatStateValue` instead of `ValueRef`. The following will be implemented separately: * Migration for existing `FlatState` values. This is required for state sync, but quite involved, so decided to keep it separately. * Inlining for cached flat state deltas: for now we keep those as `ValueRef`. * Using inlined values for transaction processing: for now we convert inlined values to `ValueRef`.
Part of #8243. This PR implements migration process for inlining `FlatState` values. For more details see the second approach in [this comment](#8243 (comment)). Migration is not currently executed on the running node, that will be implemented in a separate PR. Instead `migrate-value-inlining` sub-command is added as part of `flat-storage` command. Can be executed via `cargo run --release -p neard -- --verbose store flat-storage migrate-value-inlining`. Progress log example: ``` 2023-05-15T14:22:54.280387Z INFO store: Starting FlatState value inlining migration read_state_threads=16 batch_size=50000 ... 2023-05-15T16:00:24.210821Z DEBUG store: Processed flat state value inlining batch batch_index=1580 inlined_batch_count=50000 inlined_total_count=35943298 batch_duration=67.303985ms 2023-05-15T16:00:25.388086Z DEBUG store: Processed flat state value inlining batch batch_index=1581 inlined_batch_count=50000 inlined_total_count=35993298 batch_duration=89.054046ms ... 2023-05-15T17:02:14.707594Z INFO store: Finished FlatState value inlining migration inlined_total_count=128780116 migration_elapsed=4388.085856057s ```
…9093) Part of #8243. This PR enables the migration added in #9037 to be executed in the background on the running node. It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3. This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`. `FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead. Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.
…9093) Part of #8243. This PR enables the migration added in #9037 to be executed in the background on the running node. It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3. This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`. `FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead. Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.
Status update:
|
…ear#9093) Part of near#8243. This PR enables the migration added in near#9037 to be executed in the background on the running node. It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3. This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`. `FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead. Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.
…ear#9093) Part of near#8243. This PR enables the migration added in near#9037 to be executed in the background on the running node. It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3. This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`. `FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead. Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.
…9093) Part of #8243. This PR enables the migration added in #9037 to be executed in the background on the running node. It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3. This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`. `FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead. Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.
Can we close this issue? |
@Longarithm I suggest we keep it open until we implement reading of inlined values from the runtime |
Currently, before reading a value from state - which can be Account, AccessKey, ContractCode, etc. - we read its hash and length (“ValueRef”), and then we read the value itself. This is done so we can charge “per byte” costs for reading values. It prevents users from attaching small gas to contract calls, triggering 4MB reads and paying less gas than protocol requires.
However, it requires +1 random read. For flat storage it becomes critical, as it needs to make two random reads instead of one. Also, majority of lengths of values read is less than 100B as of summer' 2022. It means that for small values, we can store the value itself instead of using a reference.
It includes at least three parts:
Research - what is the limit for inlining - 32, 64, 128? how much does it help - for state-viewer, for real node?
Design - how do we store that? start with a byte determining value type and then ValueRef or value itself?
Migration - how to migrate to new format seamlessly, given that it takes 5 hours to traverse the whole testnet state for rpc node in parallel?
The text was updated successfully, but these errors were encountered: