Implement inlining small values for FlatState #8243

exalate-issue-sync · 2022-12-19T18:19:15Z

Currently, before reading a value from state - which can be Account, AccessKey, ContractCode, etc. - we read its hash and length (“ValueRef”), and then we read the value itself. This is done so we can charge “per byte” costs for reading values. It prevents users from attaching small gas to contract calls, triggering 4MB reads and paying less gas than protocol requires.

However, it requires +1 random read. For flat storage it becomes critical, as it needs to make two random reads instead of one. Also, majority of lengths of values read is less than 100B as of summer' 2022. It means that for small values, we can store the value itself instead of using a reference.

It includes at least three parts:

Research - what is the limit for inlining - 32, 64, 128? how much does it help - for state-viewer, for real node?

Design - how do we store that? start with a byte determining value type and then ValueRef or value itself?

Migration - how to migrate to new format seamlessly, given that it takes 5 hours to traverse the whole testnet state for rpc node in parallel?

…8729) We plan introducing small value inlining as a next step in Flat Storage, see #8243 for more details. This is not a part of MVP which means it will not be included in the first Flat Storage release. This PR wraps `ValueRef` in `FlatStateValue` enum, so later we can simply add another variant for inlined value there without changing the format on disk. This significantly simplifies migration in the future.

pugachAG · 2023-04-26T09:03:56Z

On-disk inlining size threshold

General considerations

We want to inline as many flat storage values as possible to make both runtime and state sync faster.

Inlining is triggered when the size of the value is lower than or equal to (≤) a certain threshold.

Setting threshold too high results the following security issue:
Contract can try reading a large value without attaching sufficient Gas. This makes us read large amounts of data from disk without being compensated for that.
Considering that it makes sense to have an upper bound as SSD page size which is usually 4KB. This way we are bounded by 1 random disk read.

Another consideration is space increase caused by inlining large values.

mainnet data

Some insights into our data across all shards:

Total key-value pairs: 171M
Total flat storage size (without inlining): 17.7GB
Total size of all values: 15.9GB
Total size of all value refs: 6.3GB

Percentiles of values sizes:

p10: 8
p50: 32
p75: 72
p90: 100
p95: 149
p99: 192
p99.5: 280
p99.9: 6149
p99.99: 213058
p99.9999: 735963

Inlined values ratio and space overhead for some thresholds (limit) (please note that space is comparing to refs):

limit 32:      inlined 0.70560, space: 0.7227654213530057
limit 64:      inlined 0.74369, space: 0.7387654026707645
limit 128:    inlined 0.9049, space: 0.9217107876112891
limit 256:    inlined 0.9927, space: 1.2077116434948447
limit 512:    inlined 0.9971,  space: 1.2402383202118468
limit 1024:  inlined 0.9977, space: 1.2523669464822256
limit 2048: inlined 0.9980, space: 1.263055815545039
limit 4096: inlined 0.99814, space: 1.2677205018776814

Conclusion

Setting the threshold 4KiB results in inlining 99.8% of values with ~27% values memory overhead which is acceptable.

Update: @jakmeier pointed out that we have some value encoding overhead, so let's set limit to 4000 bytes instead of 4KiB to be on the safe side.

jakmeier · 2023-05-01T15:55:49Z

Great summary and promising results!

One small nit/question. I see you tested with 4096B. I assume that's the size of the value, not its encoding? If so, would it make sense to make the value limit slightly less than 4KiB to account for the borsh encoding overhead? For example, 4KB (4000B) rather than 4096B? (Borsh uses a u8 for the enum discriminant and a u32 for the Vec length) Values of size 4097B can be nasty inefficient to store...

pugachAG · 2023-05-01T16:19:53Z

Thanks @jakmeier, that is a very good point! I agree, let's use 4000 bytes instead, I've updated my comment as well.

jakmeier · 2023-05-01T16:58:40Z

Another question, this is only for flat storage access, right? In other words, the State column is not affected? (Otherwise I'd be slightly worried about the read-amplification caused by inlining values into the ValueRef inside branches, as well as the additional memory consumption in trie node caches and RocksDB caches.)

pugachAG · 2023-05-01T17:09:33Z

@jakmeier yes, this only applies to flat storage i.e. FlatState column

pugachAG · 2023-05-03T12:22:10Z

Migration for existing data

Currently all FlatState data is stored in rocksdb as FlatStateValue::Ref. New data will be inlined on update, but we still need to migrate all other entries. So we need to implement some kind of data migration for that.

Challenges

Data migration requires reading value for all ValueRef in FlatState from State, so each of those is a random disk read. It means that migration will take roughly the same time as Flat Storage creation, so it has to be async. Performing async migration means that FlatState column will be updated while we migrate existing records, so we need to be careful not to overwrite updated and deleted values.

Migration with rocksdb merge operator

The initial idea was to implement existing data migration with rocksdb merge operator. We simply iterate over all entries in flat storage and issue merge operation for values that need to be inlined. Then our merge operator implementation takes care of checking that the existing value still corresponds to the value we are trying to inline, see the implementation here.

Unfortunately this approach doesn't work for deleted values, see failing inlined_after_delete test. For some reason rocksdb doesn't allow to preserve deleted value, so merge operator should always result in existing record.

We could solve that by introducing a special empty value and then filter it on reads (similar to how we currently handle refcounting). Those values can be then removed as a second step of the migration. In the next release we can remove the filtering since we know that all empty values were deleted.

In this comment Robin also mentioned another issues with this approach:

once the merge operator is defined it should always stick around... There's no way to "flatten it out" unless you iterate over everything and call "set" to ensure no merges ever exist again.

Migration with temporary pausing FlatState update

Another approach is to make sure FlatState is not updated in between reading and writing value for inlining. This can be achieved by temporary skipping flat head updates. This is safe to pause flat head updates since we accumulate all changes in deltas which will be committed eventually.

Implementation is as follows:

Background migration creates FlatState iterator and reads corresponding values for inlining in batches. This iterator might contain outdates entries as flat head moves and that is fine.
Once we resolved a batch of values, we pause flat head updates and read records in that range from the current FlatState. We issue writes for the values that need to be inlined. This ensures that we won't update any values that were changed since we created the iterator for step 1.
Unpause flat head updates and read the next batch of values with the iterator from step 1. Flat head will have enough time to catch up while we resolve values from State, which is the bottleneck for the migration.

Migration with staging column

As described by Robin in this comment

Part of #8243. This PR implements value inlining MVP for state sync: * add `FlatStateValue::Inlined` variant to store inlined values as part of `FlatState` and `FlatStateChanges` on disk. * change flat storage API to return `FlatStateValue` instead of `ValueRef`. The following will be implemented separately: * Migration for existing `FlatState` values. This is required for state sync, but quite involved, so decided to keep it separately. * Inlining for cached flat state deltas: for now we keep those as `ValueRef`. * Using inlined values for transaction processing: for now we convert inlined values to `ValueRef`.

robin-near · 2023-05-03T23:12:44Z

I think the merge operator approach you've come up with is very creative. However once the merge operator is defined it should always stick around... There's no way to "flatten it out" unless you iterate over everything and call "set" to ensure no merges ever exist again.

For the deletion case, I suppose you could solve that by distinguishing between a merge and a set. So a value would be either a ValueRef, a Value, or a ValueMerge. You'd only merge ValueMerge during migrations. The merge operator is defined as, if the base is ValueRef, then you expect ValueMerges and the logic is what you implemented except the returned value is a Value. If the base is a Value that means the new code has already overwritten it so you just return the same Value. If the base is a ValueMerge then you know that it was supposed to be deleted and so you return None. In addition, after reading the value of the result is a ValueMerge then you have to interpret it as if it were a None.

But the problem is this is hacky and polluting; hacky as in you have to invent a format for serializing Value and ValueMerge so that they can be distinguished from a serialized ValueRef whose format you can't change. Polluting as in it's hard to get rid of ValueMerge unless you do another data migration. And you have to treat ValueMerge specially.

Anyway, I feel like we are stretching the limitation of merge operator. I think I prefer the alternative design.

robin-near · 2023-05-03T23:20:36Z

Perhaps also you could do something similar to the pause-writes idea but do it closer to the db level. Introduce a new column family P where all new flat state updates should be written to. Readers should read from P first and if an entry exists then return that instead. Run the async migration on the old column family. Once that's done, schedule an offline migration to iterate through P and apply it as updates to the original column family, and delete P. It's like doing a rebase. Not a well fleshed out idea.

Part of #8243. This PR implements value inlining MVP for state sync: * add `FlatStateValue::Inlined` variant to store inlined values as part of `FlatState` and `FlatStateChanges` on disk. * change flat storage API to return `FlatStateValue` instead of `ValueRef`. The following will be implemented separately: * Migration for existing `FlatState` values. This is required for state sync, but quite involved, so decided to keep it separately. * Inlining for cached flat state deltas: for now we keep those as `ValueRef`. * Using inlined values for transaction processing: for now we convert inlined values to `ValueRef`.

Part of #8243. This PR consists of 2 parts: * 08ecc51: un-revert #9005, see #8988 for the detailed description of the changes. * 3d29347: db migration for flat state delta changes to avoid backward compatibility issues

Part of #8243. This PR implements migration process for inlining `FlatState` values. For more details see the second approach in [this comment](#8243 (comment)). Migration is not currently executed on the running node, that will be implemented in a separate PR. Instead `migrate-value-inlining` sub-command is added as part of `flat-storage` command. Can be executed via `cargo run --release -p neard -- --verbose store flat-storage migrate-value-inlining`. Progress log example: ``` 2023-05-15T14:22:54.280387Z INFO store: Starting FlatState value inlining migration read_state_threads=16 batch_size=50000 ... 2023-05-15T16:00:24.210821Z DEBUG store: Processed flat state value inlining batch batch_index=1580 inlined_batch_count=50000 inlined_total_count=35943298 batch_duration=67.303985ms 2023-05-15T16:00:25.388086Z DEBUG store: Processed flat state value inlining batch batch_index=1581 inlined_batch_count=50000 inlined_total_count=35993298 batch_duration=89.054046ms ... 2023-05-15T17:02:14.707594Z INFO store: Finished FlatState value inlining migration inlined_total_count=128780116 migration_elapsed=4388.085856057s ```

…9093) Part of #8243. This PR enables the migration added in #9037 to be executed in the background on the running node. It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3. This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`. `FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead. Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.

pugachAG · 2023-06-06T16:54:05Z

Status update:

Inlining is implemented as part of flat storage.
Migration for existing data in FlatState is implemented.
Inlined values are not used when processing receipts, those are converted to ValueRefs instead: code. Also inlined values are not used for in-memory deltas. The plan is to have a different inlining threshold for cached deltas, but that would result in different chunk cache state depending on whether block is final or not. That results in different gas costs, which is an issue. We decided that solving that is not the highest priority right now since actual bottleneck is write performance.

…ear#9093) Part of near#8243. This PR enables the migration added in near#9037 to be executed in the background on the running node. It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3. This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`. `FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead. Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.

…9093) Part of #8243. This PR enables the migration added in #9037 to be executed in the background on the running node. It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3. This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`. `FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead. Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.

Longarithm · 2023-06-16T13:45:21Z

Can we close this issue?

pugachAG · 2023-06-19T12:13:36Z

@Longarithm I suggest we keep it open until we implement reading of inlined values from the runtime

exalate-issue-sync bot added the T-storage label Dec 19, 2022

exalate-issue-sync bot changed the title ~~Implement in-lining small values~~ Implement in-lining small values for State Dec 19, 2022

Longarithm mentioned this issue Dec 23, 2022

Flat storage MVP #7327

Closed

26 tasks

Longarithm mentioned this issue Mar 6, 2023

Flat storage post launch tasks #8676

Closed

6 tasks

walnut-the-cat added A-storage Area: storage and databases T-core Team: issues relevant to the core team and removed T-storage labels Mar 13, 2023

pugachAG mentioned this issue Mar 14, 2023

feat: make FlatState value future proof for value inlining support #8729

Merged

Longarithm mentioned this issue Apr 6, 2023

Stabilize FS read on Mainnet #8893

Closed

4 tasks

pugachAG self-assigned this Apr 24, 2023

pugachAG mentioned this issue Apr 30, 2023

feat: implement inlining for small values in flat storage #8988

Merged

pugachAG changed the title ~~Implement in-lining small values for State~~ Implement inlining small values for FlatState May 3, 2023

pugachAG mentioned this issue May 8, 2023

feat: implement inlining for small values in flat storage #9029

Merged

pugachAG mentioned this issue May 15, 2023

feat: flat state value inlining migration #9037

Merged

pugachAG mentioned this issue May 23, 2023

feat: enable FlatState values inlining migration in the background #9093

Merged

pugachAG closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement inlining small values for FlatState #8243

Implement inlining small values for FlatState #8243

exalate-issue-sync bot commented Dec 19, 2022 •

edited by Longarithm

Loading

pugachAG commented Apr 26, 2023 •

edited

Loading

jakmeier commented May 1, 2023

pugachAG commented May 1, 2023

jakmeier commented May 1, 2023

pugachAG commented May 1, 2023

pugachAG commented May 3, 2023 •

edited

Loading

robin-near commented May 3, 2023

robin-near commented May 3, 2023

pugachAG commented Jun 6, 2023

Longarithm commented Jun 16, 2023

pugachAG commented Jun 19, 2023

Implement inlining small values for FlatState #8243

Implement inlining small values for FlatState #8243

Comments

exalate-issue-sync bot commented Dec 19, 2022 • edited by Longarithm Loading

pugachAG commented Apr 26, 2023 • edited Loading

On-disk inlining size threshold

General considerations

mainnet data

Conclusion

jakmeier commented May 1, 2023

pugachAG commented May 1, 2023

jakmeier commented May 1, 2023

pugachAG commented May 1, 2023

pugachAG commented May 3, 2023 • edited Loading

Migration for existing data

Challenges

Migration with rocksdb merge operator

Migration with temporary pausing FlatState update

Migration with staging column

robin-near commented May 3, 2023

robin-near commented May 3, 2023

pugachAG commented Jun 6, 2023

Longarithm commented Jun 16, 2023

pugachAG commented Jun 19, 2023

exalate-issue-sync bot commented Dec 19, 2022 •

edited by Longarithm

Loading

pugachAG commented Apr 26, 2023 •

edited

Loading

pugachAG commented May 3, 2023 •

edited

Loading