Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement inlining small values for FlatState #8243

Closed
exalate-issue-sync bot opened this issue Dec 19, 2022 · 11 comments
Closed

Implement inlining small values for FlatState #8243

exalate-issue-sync bot opened this issue Dec 19, 2022 · 11 comments
Assignees
Labels
A-storage Area: storage and databases T-core Team: issues relevant to the core team

Comments

@exalate-issue-sync
Copy link

exalate-issue-sync bot commented Dec 19, 2022

Currently, before reading a value from state - which can be Account, AccessKey, ContractCode, etc. - we read its hash and length (“ValueRef”), and then we read the value itself. This is done so we can charge “per byte” costs for reading values. It prevents users from attaching small gas to contract calls, triggering 4MB reads and paying less gas than protocol requires.

However, it requires +1 random read. For flat storage it becomes critical, as it needs to make two random reads instead of one. Also, majority of lengths of values read is less than 100B as of summer' 2022. It means that for small values, we can store the value itself instead of using a reference.

It includes at least three parts:

Research - what is the limit for inlining - 32, 64, 128? how much does it help - for state-viewer, for real node?

Design - how do we store that? start with a byte determining value type and then ValueRef or value itself?

Migration - how to migrate to new format seamlessly, given that it takes 5 hours to traverse the whole testnet state for rpc node in parallel?

@exalate-issue-sync exalate-issue-sync bot changed the title Implement in-lining small values Implement in-lining small values for State Dec 19, 2022
@Longarithm Longarithm mentioned this issue Dec 23, 2022
26 tasks
@walnut-the-cat walnut-the-cat added A-storage Area: storage and databases T-core Team: issues relevant to the core team and removed T-storage labels Mar 13, 2023
near-bulldozer bot pushed a commit that referenced this issue Mar 15, 2023
…8729)

We plan introducing small value inlining as a next step in Flat Storage, see #8243 for more details. This is not a part of MVP which means it will not be included in the first Flat Storage release.
This PR wraps `ValueRef` in `FlatStateValue` enum, so later we can simply add another variant for inlined value there without changing the format on disk. This significantly simplifies migration in the future.
@pugachAG pugachAG self-assigned this Apr 24, 2023
@pugachAG
Copy link
Contributor

pugachAG commented Apr 26, 2023

On-disk inlining size threshold

General considerations

We want to inline as many flat storage values as possible to make both runtime and state sync faster.

Inlining is triggered when the size of the value is lower than or equal to () a certain threshold.

Setting threshold too high results the following security issue:
Contract can try reading a large value without attaching sufficient Gas. This makes us read large amounts of data from disk without being compensated for that.
Considering that it makes sense to have an upper bound as SSD page size which is usually 4KB. This way we are bounded by 1 random disk read.

Another consideration is space increase caused by inlining large values.

mainnet data

Some insights into our data across all shards:

  • Total key-value pairs: 171M
  • Total flat storage size (without inlining): 17.7GB
  • Total size of all values: 15.9GB
  • Total size of all value refs: 6.3GB

Percentiles of values sizes:

p10: 8
p50: 32
p75: 72
p90: 100
p95: 149
p99: 192
p99.5: 280
p99.9: 6149
p99.99: 213058
p99.9999: 735963

Inlined values ratio and space overhead for some thresholds (limit) (please note that space is comparing to refs):

limit 32:      inlined 0.70560, space: 0.7227654213530057
limit 64:      inlined 0.74369, space: 0.7387654026707645
limit 128:    inlined 0.9049, space: 0.9217107876112891
limit 256:    inlined 0.9927, space: 1.2077116434948447
limit 512:    inlined 0.9971,  space: 1.2402383202118468
limit 1024:  inlined 0.9977, space: 1.2523669464822256
limit 2048: inlined 0.9980, space: 1.263055815545039
limit 4096: inlined 0.99814, space: 1.2677205018776814

Conclusion

Setting the threshold 4KiB results in inlining 99.8% of values with ~27% values memory overhead which is acceptable.

Update: @jakmeier pointed out that we have some value encoding overhead, so let's set limit to 4000 bytes instead of 4KiB to be on the safe side.

@jakmeier
Copy link
Contributor

jakmeier commented May 1, 2023

Great summary and promising results!

One small nit/question. I see you tested with 4096B. I assume that's the size of the value, not its encoding? If so, would it make sense to make the value limit slightly less than 4KiB to account for the borsh encoding overhead? For example, 4KB (4000B) rather than 4096B? (Borsh uses a u8 for the enum discriminant and a u32 for the Vec length) Values of size 4097B can be nasty inefficient to store...

@pugachAG
Copy link
Contributor

pugachAG commented May 1, 2023

Thanks @jakmeier, that is a very good point! I agree, let's use 4000 bytes instead, I've updated my comment as well.

@jakmeier
Copy link
Contributor

jakmeier commented May 1, 2023

Another question, this is only for flat storage access, right? In other words, the State column is not affected? (Otherwise I'd be slightly worried about the read-amplification caused by inlining values into the ValueRef inside branches, as well as the additional memory consumption in trie node caches and RocksDB caches.)

@pugachAG
Copy link
Contributor

pugachAG commented May 1, 2023

@jakmeier yes, this only applies to flat storage i.e. FlatState column

@pugachAG pugachAG changed the title Implement in-lining small values for State Implement inlining small values for FlatState May 3, 2023
@pugachAG
Copy link
Contributor

pugachAG commented May 3, 2023

Migration for existing data

Currently all FlatState data is stored in rocksdb as FlatStateValue::Ref. New data will be inlined on update, but we still need to migrate all other entries. So we need to implement some kind of data migration for that.

Challenges

Data migration requires reading value for all ValueRef in FlatState from State, so each of those is a random disk read. It means that migration will take roughly the same time as Flat Storage creation, so it has to be async. Performing async migration means that FlatState column will be updated while we migrate existing records, so we need to be careful not to overwrite updated and deleted values.

Migration with rocksdb merge operator

The initial idea was to implement existing data migration with rocksdb merge operator. We simply iterate over all entries in flat storage and issue merge operation for values that need to be inlined. Then our merge operator implementation takes care of checking that the existing value still corresponds to the value we are trying to inline, see the implementation here.

Unfortunately this approach doesn't work for deleted values, see failing inlined_after_delete test. For some reason rocksdb doesn't allow to preserve deleted value, so merge operator should always result in existing record.

We could solve that by introducing a special empty value and then filter it on reads (similar to how we currently handle refcounting). Those values can be then removed as a second step of the migration. In the next release we can remove the filtering since we know that all empty values were deleted.

In this comment Robin also mentioned another issues with this approach:

once the merge operator is defined it should always stick around... There's no way to "flatten it out" unless you iterate over everything and call "set" to ensure no merges ever exist again.

Migration with temporary pausing FlatState update

Another approach is to make sure FlatState is not updated in between reading and writing value for inlining. This can be achieved by temporary skipping flat head updates. This is safe to pause flat head updates since we accumulate all changes in deltas which will be committed eventually.

Implementation is as follows:

  1. Background migration creates FlatState iterator and reads corresponding values for inlining in batches. This iterator might contain outdates entries as flat head moves and that is fine.
  2. Once we resolved a batch of values, we pause flat head updates and read records in that range from the current FlatState. We issue writes for the values that need to be inlined. This ensures that we won't update any values that were changed since we created the iterator for step 1.
  3. Unpause flat head updates and read the next batch of values with the iterator from step 1. Flat head will have enough time to catch up while we resolve values from State, which is the bottleneck for the migration.

Migration with staging column

As described by Robin in this comment

near-bulldozer bot pushed a commit that referenced this issue May 3, 2023
Part of #8243.

This PR implements value inlining MVP for state sync:
* add `FlatStateValue::Inlined` variant to store inlined values as part of `FlatState` and `FlatStateChanges` on disk.
* change flat storage API to return `FlatStateValue` instead of `ValueRef`.

The following will be implemented separately:
* Migration for existing `FlatState` values. This is required for state sync, but quite involved, so decided to keep it separately.
* Inlining for cached flat state deltas: for now we keep those as `ValueRef`.
* Using inlined values for transaction processing: for now we convert inlined values to `ValueRef`.
@robin-near
Copy link
Contributor

I think the merge operator approach you've come up with is very creative. However once the merge operator is defined it should always stick around... There's no way to "flatten it out" unless you iterate over everything and call "set" to ensure no merges ever exist again.

For the deletion case, I suppose you could solve that by distinguishing between a merge and a set. So a value would be either a ValueRef, a Value, or a ValueMerge. You'd only merge ValueMerge during migrations. The merge operator is defined as, if the base is ValueRef, then you expect ValueMerges and the logic is what you implemented except the returned value is a Value. If the base is a Value that means the new code has already overwritten it so you just return the same Value. If the base is a ValueMerge then you know that it was supposed to be deleted and so you return None. In addition, after reading the value of the result is a ValueMerge then you have to interpret it as if it were a None.

But the problem is this is hacky and polluting; hacky as in you have to invent a format for serializing Value and ValueMerge so that they can be distinguished from a serialized ValueRef whose format you can't change. Polluting as in it's hard to get rid of ValueMerge unless you do another data migration. And you have to treat ValueMerge specially.

Anyway, I feel like we are stretching the limitation of merge operator. I think I prefer the alternative design.

@robin-near
Copy link
Contributor

Perhaps also you could do something similar to the pause-writes idea but do it closer to the db level. Introduce a new column family P where all new flat state updates should be written to. Readers should read from P first and if an entry exists then return that instead. Run the async migration on the old column family. Once that's done, schedule an offline migration to iterate through P and apply it as updates to the original column family, and delete P. It's like doing a rebase. Not a well fleshed out idea.

nikurt pushed a commit that referenced this issue May 4, 2023
Part of #8243.

This PR implements value inlining MVP for state sync:
* add `FlatStateValue::Inlined` variant to store inlined values as part of `FlatState` and `FlatStateChanges` on disk.
* change flat storage API to return `FlatStateValue` instead of `ValueRef`.

The following will be implemented separately:
* Migration for existing `FlatState` values. This is required for state sync, but quite involved, so decided to keep it separately.
* Inlining for cached flat state deltas: for now we keep those as `ValueRef`.
* Using inlined values for transaction processing: for now we convert inlined values to `ValueRef`.
near-bulldozer bot pushed a commit that referenced this issue May 8, 2023
Part of #8243.

This PR consists of 2 parts:
* 08ecc51: un-revert #9005, see #8988 for the detailed description of the changes.
* 3d29347: db migration for flat state delta changes to avoid backward compatibility issues
nikurt pushed a commit that referenced this issue May 10, 2023
Part of #8243.

This PR consists of 2 parts:
* 08ecc51: un-revert #9005, see #8988 for the detailed description of the changes.
* 3d29347: db migration for flat state delta changes to avoid backward compatibility issues
near-bulldozer bot pushed a commit that referenced this issue May 19, 2023
Part of #8243.

This PR implements migration process for inlining `FlatState` values. For more details see the second approach in [this comment](#8243 (comment)).
Migration is not currently executed on the running node, that will be implemented in a separate PR. Instead `migrate-value-inlining` sub-command is added as part of `flat-storage` command.


Can be executed via `cargo run --release -p neard -- --verbose store flat-storage migrate-value-inlining`. Progress log example:
```
2023-05-15T14:22:54.280387Z  INFO store: Starting FlatState value inlining migration read_state_threads=16 batch_size=50000
...
2023-05-15T16:00:24.210821Z DEBUG store: Processed flat state value inlining batch batch_index=1580 inlined_batch_count=50000 inlined_total_count=35943298 batch_duration=67.303985ms
2023-05-15T16:00:25.388086Z DEBUG store: Processed flat state value inlining batch batch_index=1581 inlined_batch_count=50000 inlined_total_count=35993298 batch_duration=89.054046ms
...
2023-05-15T17:02:14.707594Z  INFO store: Finished FlatState value inlining migration inlined_total_count=128780116 migration_elapsed=4388.085856057s
```
near-bulldozer bot pushed a commit that referenced this issue May 25, 2023
…9093)

Part of #8243.

This PR enables the migration added in #9037 to be executed in the background on the running node.
It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3.

This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`.

`FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead.

Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.
nikurt pushed a commit that referenced this issue May 31, 2023
…9093)

Part of #8243.

This PR enables the migration added in #9037 to be executed in the background on the running node.
It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3.

This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`.

`FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead.

Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.
@pugachAG
Copy link
Contributor

pugachAG commented Jun 6, 2023

Status update:

  • Inlining is implemented as part of flat storage.
  • Migration for existing data in FlatState is implemented.
  • Inlined values are not used when processing receipts, those are converted to ValueRefs instead: code. Also inlined values are not used for in-memory deltas. The plan is to have a different inlining threshold for cached deltas, but that would result in different chunk cache state depending on whether block is final or not. That results in different gas costs, which is an issue. We decided that solving that is not the highest priority right now since actual bottleneck is write performance.

nikurt pushed a commit to nikurt/nearcore that referenced this issue Jun 8, 2023
…ear#9093)

Part of near#8243.

This PR enables the migration added in near#9037 to be executed in the background on the running node.
It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3.

This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`.

`FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead.

Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.
nikurt pushed a commit to nikurt/nearcore that referenced this issue Jun 8, 2023
…ear#9093)

Part of near#8243.

This PR enables the migration added in near#9037 to be executed in the background on the running node.
It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3.

This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`.

`FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead.

Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.
nikurt pushed a commit that referenced this issue Jun 13, 2023
…9093)

Part of #8243.

This PR enables the migration added in #9037 to be executed in the background on the running node.
It supports graceful stop when the node is shut down. The implementation is heavily inspired by state sync background dumping to S3.

This PR also introduces a new column `DBCol::Misc`. For now it only stores the status of the migration, but it can hold any small pieces of data, similar to `DBCol::BlockMisc`.

`FlatStorageManager` is exposed as part of `RuntimeAdapter` in this PR. This is the first step in cleaning `RuntimeAdapter` from all other flat storage related methods, as the manager can be directly used instead.

Tested by manually running a node and checking metrics and log messages. After that flat storage was checked with `flat-storage verify` cmd.
@Longarithm
Copy link
Member

Can we close this issue?

@pugachAG
Copy link
Contributor

@Longarithm I suggest we keep it open until we implement reading of inlined values from the runtime

@pugachAG pugachAG closed this as completed Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: storage and databases T-core Team: issues relevant to the core team
Projects
None yet
Development

No branches or pull requests

5 participants