-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit memory consumption of flat storage deltas #8436
Comments
After experimenting, I realized that we need to get rid of Current logic is naive and for each Proposal - change Delta format from (block_hash, shard_id) -> HashMap<Key, Value> to (block_hash, shard_id, key) -> Value.
We cache up to ~10 deltas anyway, so for 99.9% of the time there are no additional disk reads. |
@Longarithm My main concern here is that this may lead to drastic performance degradation when there are many unfinalized blocks in the system, and that may make the system go into a bad feedback loop when it is already in a bad state that the more unfinalized blocks there are, the longer block processing time is, and that makes it harder for blocks to be finalized. The reformating makes sense, but I still want to combine that with the protocol change to limit the number of unfinalized blocks to 12 blocks. You can also implement this first and do some performance testing later to see how much performance degrades when there are more than >20 blocks unfinalized. |
I usually posted updates in Zulip but I'll try posting references here, as we align on Core team focus from github kanban. New experiment results are very confusing. For finality delay of 60 blocks, even if we cache all deltas, get_ref performance jumps from 50us to 500us: https://near.zulipchat.com/#narrow/stream/345766-pagoda.2Fstorage.2Fflat-storage/topic/limiting.20RAM.20usage/near/327970531 I think about looking into get_ref more precisely and see from where the latency comes. |
On the previous week, we agreed on our approach for MVP ("Use key hash") and updated the summary (see issue description). Based on memory estimation, we can store deltas for around 50 full blocks, and after that we will display in-memory message that flat storage is likely to have troubles. |
#8662 merged |
We made experiment for nodes with 50 blocks finality lag and with 0 finality lag, and for cached deltas results are the same:
It is clearly visible on metrics that first node regularly "falls behind" by 50 blocks, and second node is not, but get_ref time is similar, so we can close the issue. |
Summary
With flat storage, we start relying on fact that size of recent state updates stored in memory is well limited. Otherwise, if blocks are not finalized for long time, all flat storage deltas will be stored in memory and node will crash. Moreover, on restart it will try to put deltas to memory and crash again.
The idea is to store limited amount of deltas in memory. Limit on delta size is ~50 MB for 4 shards based on storage costs. The main contributor is key length, because we store
ValueRef
s as values which have limited size. Let's say that we store only 10 deltas - then memory usage will be 500 MB. Other deltas will be read from disk.The weakness of this solution is that if there is no finality, reading deltas from disk may slow BP down significantly. On the other hand, if there is no finality for >100 blocks, then we are likely to have other issues with chain and some manual intervention will be required. So the plan is to check if reading 100 deltas from disk does not introduce slowdown, and if not, we proceed with it.
Otherwise we can implement more heuristics, like "if block is too far from last final block, set gas limit in it to zero" which almost guarantees empty state updates: #8437
Assumptions
Upper bound on the number of updates in single chunk is 20k:
block_gas_limit / wasm_storage_write_base = 1300 Tgas / 64 Ggas = 20k
(calculations from the NEP).Max keys size is
2KB
Approaches
Naive
We can cache
HashMap<TrieKey, ValueRef>
which results in max deltas size of50MB
.It means that storing deltas for 20 such blocks will take
1GB
of RAMUse key hash
Let's use TrieKey hash instead of the whole key, so we store 32 bytes per key regardless of the size. This way single max size of delta for a single block is
20k * (32B + 32B) = 1.28MB
.TODO (@pugachAG): check hashmap overhead
Storing 200 deltas would take
1.28MB * 200 = 256MB
.This is the approach we proceed with in MVP.
Use Bloom filter
With this approach we keep Bloom filter for keys for each delta (see zulip thread). This way we can check if the value was updated in each chunk and read the updated value from rocksdb.
Assuming
16 bits = 2B
per value for Bloom filter this would require20k * 2B = 40KB
per delta, so we can keep deltas for multiple thousands of blocks.Rely on rocksdb Bloom filters
With this approach we do not cache anything in memory and calls to rocksdb for delta. This relies on rocksdb-provided Bloom filters to be able to quickly determine that the key is absent in deltas for the given block.
This approach requires careful rocksdb tuning. With our current rocksdb configuration and synthetic data (500 blocks, 10k entires in each block, 300B key size) checking if key is present in deltas for a fixed block takes on average 0.8 - 3 μs, and highly depends on block cache size and compaction. So relying on that is too risky.
In the long term this seems like the most reasonable approach, but also the most time-consuming as it requires deep understanding of rocksdb internals as well as a lot of experimentation with both real-world and edge case data.
The text was updated successfully, but these errors were encountered: