-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flat State Gas Costs (Read FS only) #8006
Comments
Theory: 200 Ggas base costFS replaces the With this assumption, we expect one For If we assume to keep the same caching behavior as today, we can therefore safely replace the old cost for reading the final value (56us + 16us) with a new cost of 100us. For those within Pagoda, additional analysis on this cost for Sweatcoin traffic specifically can be found in this private document. (If someone outside Pagoda requires access, ping me on Zulip.) |
Practical observations: In ProgressDB column read latency
get_ref latency
* Measured in Noovmeber 2022 |
Overhead of FlatState in normal operationRegarding additional costs, aside from the I'll describe my finding, starting with the most important one. FlatStateDeltasTesting the Sweatcoin load with cold caches, I didn't observe ANY reads from FlatStateDeltas. That's good, we keep deltas in memory, so should only need to read them upon start. Also good, I did see deltas being written and deleted. # count of FlatStateDeltas accesses over ~15 mins
8408 DELETE FlatStateDeltas
4204 SET FlatStateDeltas This looks like we set 1 deltas per chunk, and delete them twice. That seems odd, not sure why we delete each delta twice. I'll look into that, but for performance, it doesn't really matter. The delta sizes were also not too bad. Average size was 7049.9 B.
FlatStateMiscThis column is accessed to read and update flat head. Over the period of ~15 minutes, I saw four keys being set to about 1k times. That's once per chunk, as expected. # count of FlatStateMisc accesses and the keys over ~15 min
# grep FlatStateMisc feb27_mainnet_fsandtrie_withfscache.io_trace | sort | uniq --count
1051 SET FlatStateMisc "SEVBRAAAAAAAAAAA" size=32
1051 SET FlatStateMisc "SEVBRAEAAAAAAAAA" size=32
1051 SET FlatStateMisc "SEVBRAIAAAAAAAAA" size=32
1051 SET FlatStateMisc "SEVBRAMAAAAAAAAA" size=32 In terms of performance, this should be neglible. Red herrings
|
Based on @jakmeier' estimations, we need to cache `ValueRef`s for flat storage head (see #8006). RocksDB internal impl and block cache doesn't help, and we need to make flat storage performance to be at least comparable to trie performance in MVP, in order not to make undercharging issue worse. This cache lives inside `FlatStorageState`, can be accessed in `get_ref` before attempt to read value ref from flat storage head, and must be updated when we apply delta. I think it makes sense to make cache capacity configurable, and this config fits into `StoreConfig`. I don't like that is propagated to `FlatStorageState` from `ShardTries`, it makes trie storage and flat storage mixed even more. Perhaps it needs to be fully moved inside `FlatStateFactory`, but I am not sure. ## Testing * extend `flat_storage_state_sanity` to check both cached and non-cached versions; * `flat_storage_state_cache_eviction` to check that eviction strategy is applied correctly.
Which branch do you use for testing? Perhaps in the past I've recommended master...fs-0209-base-test as most configurable. Though we don't need it anymore because we are not going to store items in deltas separately. Anyway, there we use |
I was using master from last Friday with just the FS cache added. Basically what we have in master now plus #8540, unless I missed some FS PRs since then. Do you have important changes on branches outside of master? |
No, master with cache is great. Then I don't see why there are exactly 2x deletions, they are triggered only once in |
Finishing this issue and providing a final proposal for how expensive the new storage requests should be is my current focus.
Expect a (final) cost proposal from my side by tomorrow, or at the latest, the day after. |
I'm falling behind on this a bit, sorry. The good news is, I already have the estimator integration, I just need to run it on the proper machine and summarize the results. If tomorrow I find slightly more time in between meetings than today, then I will have the proposed costs ready tomorrow. |
Using the estimator, I evaluated the overhead of in-memory delta processing. Here are the measurements:
Looks like a very clean linear increase in read time for each delta added. With a relatively small effect from the number of keys per delta. And as expected, the write time does not change. It's 0.8us per delta if we have 50 keys per delta. Or 0.81us per delta if we have 1000 keys per delta. Our assumption for RAM usage was that 50 deltas is the maximum we support. Taking that for gas costs as well, we expect about a 40us increase per read for delta resolving in the worst case. |
Cost optionsWe already decided that we want to keep gas parameters the same all around. But we need add higher compute costs to compensate. I want to suggest two ways of looking at it. Either we just increase costs to cover the difference between a flat state and state lookup. Or, we set the costs to what we think they should be in absolute terms, just looking at DB latencies of an RPC node. Option 1: Cover only the additional cost of FlatStateHere we just add the 50us FlatState latency to every read base cost to compensate for the TTN that is no longer charged for them.
Option 2: Set the costs to true costsHere I take the measured 1100us access time from RPC nodes. For trie nodes, I would assume 90% hit rate, which is worse than the measurements but certainly a possibility with nasty access patterns. But for the final value reads (covered by base costs) I would assume no cache hit rate. Difference between reads and writes become negligible here, since writes are not blocking on IO.
But these factors are on the excessive side. Especially since we know that the network runs just fine with current costs. Option 3: Balance the two approachesI suggest we do the following:
I must admit, the 200us number is a bit arbitrary. We discussed 10k IOPS as a useful baseline before, which would imply 100us base cost. But actual benchmarks show that 100us is still quite optimistic for RocksDB with large state on a network attached SSD. One could also argue for 400us based on those results. But 400us reaches the limits of what we can do before we seriously harm current mainnet users. The throuhgput would be visibly limited compared to today and gas prices would potentially start to increase. Just as an exmple, a Sweatcoin batch has about 30% of the WASM cost in storage base costs. With 400us, we would have A factor of 7 on base costs which would mean 70% + 30% * 7 = 280% of current compute cost. That in turns means 357Tgas of their batches would fill an entire chunk and start causing congestion. |
Thank you for the detailed evaluation, Jakob! |
Naive question. Are we trying to find the balance between 'not affecting contract' and 'minimizing our cost damage as much as possible'? It is not super clear to me what the baseline assumption we are having with 200us. (e.g. it's still not enough number to cover the actual usage) |
Yeah, that's the balance we are trying to find. But the problem is we don't know the true cost, or rather, the true cost is not well defined as a function of just the number of requests. One way to cover the "true full cost" would be the make worst case assumptions and go with 1100us. But that's too much damage to users. As I wrote in my previous comment, 200us does not have any baseline assumptions, it's a rather arbitrary point in the range of possible values we could consider. |
Today at the flat storage engineering meeting, we decided to move forward with option 3 for computes costs. I will make sure to update these on the NEP and set the compute costs in nearcore, in due time. But we have to finish compute costs work first. |
Should we close the issue? |
Yes makes sense, I will close it now. |
This is a protocol feature, updating the storage cost as agreed in near#8006 to allow flat storage to be deployed without undercharging risks.
This is a protocol feature, updating the storage cost as agreed in near#8006 to allow flat storage to be deployed without undercharging risks.
This is a protocol feature, updating the storage cost as agreed in #8006 to allow flat storage to be deployed without undercharging risks.
The code is literally removing `protocol_feature_flat_state` and moving feature to stable protocol. We also disable `test_state_sync` as this is part of refactor we can do in Q2. ## Feature to stabilize Here we stabilize Flat Storage for reads, which means that all state reads in the client during block processing will query flat storage instead of Trie. Flat Storage is another index for blockchain state, reducing number of DB accesses for state read from `2 * key.len()` in the worst case to 2. This will trigger background creation of flat storage, using 8 threads and finishing in 15h for RPC node and 2d for archival node. After that all non-contract reads will go through flat storage, for which special "chunk views" will be created. When protocol upgrade happens, contracts reads will go through flat storage as well. Also compute costs will change as Option 3 suggests [here](#8006 (comment)). It is to be merged separately, but we need to ensure that both costs change and flat storage go into next release together. ## Context Find more details in: - Overview: https://near.github.io/nearcore/architecture/storage/flat_storage.html - Approved NEP: https://github.com/near/NEPs/blob/master/neps/nep-0339.md - Tracking issue: #7327 ## Testing and QA * Flat storage structs are covered by unit tests; * Integration tests check that chain behaviour is preserved and costs are changed as expected; * Flat storage spent ~2 months in betanet with assertion that flat and trie `ValueRef`s are the same; * We were running testnet/mainnet nodes for ~2 months with the same assertion. We checked that performance is not degraded, see e.g. https://nearinc.grafana.net/d/Vg9SREA4k/flat-storage-test?orgId=1&var-chain_id=mainnet&var-node_id=logunov-mainnet-fs-1&from=1677804289279&to=1678088806154 checking that even with finality lag of 50 blocks performance is not impacted. Small exception is that we updated data layout several times during development, but we checked that results are unchanged. ## Checklist - [x] Include compute costs after they are merged - #8924 - [x] https://nayduck.near.org/#/run/2916 - [x] Update CHANGELOG.md to include this protocol feature in the `Unreleased` section.
This is a protocol feature, updating the storage cost as agreed in #8006 to allow flat storage to be deployed without undercharging risks.
The code is literally removing `protocol_feature_flat_state` and moving feature to stable protocol. We also disable `test_state_sync` as this is part of refactor we can do in Q2. ## Feature to stabilize Here we stabilize Flat Storage for reads, which means that all state reads in the client during block processing will query flat storage instead of Trie. Flat Storage is another index for blockchain state, reducing number of DB accesses for state read from `2 * key.len()` in the worst case to 2. This will trigger background creation of flat storage, using 8 threads and finishing in 15h for RPC node and 2d for archival node. After that all non-contract reads will go through flat storage, for which special "chunk views" will be created. When protocol upgrade happens, contracts reads will go through flat storage as well. Also compute costs will change as Option 3 suggests [here](#8006 (comment)). It is to be merged separately, but we need to ensure that both costs change and flat storage go into next release together. ## Context Find more details in: - Overview: https://near.github.io/nearcore/architecture/storage/flat_storage.html - Approved NEP: https://github.com/near/NEPs/blob/master/neps/nep-0339.md - Tracking issue: #7327 ## Testing and QA * Flat storage structs are covered by unit tests; * Integration tests check that chain behaviour is preserved and costs are changed as expected; * Flat storage spent ~2 months in betanet with assertion that flat and trie `ValueRef`s are the same; * We were running testnet/mainnet nodes for ~2 months with the same assertion. We checked that performance is not degraded, see e.g. https://nearinc.grafana.net/d/Vg9SREA4k/flat-storage-test?orgId=1&var-chain_id=mainnet&var-node_id=logunov-mainnet-fs-1&from=1677804289279&to=1678088806154 checking that even with finality lag of 50 blocks performance is not impacted. Small exception is that we updated data layout several times during development, but we checked that results are unchanged. ## Checklist - [x] Include compute costs after they are merged - #8924 - [x] https://nayduck.near.org/#/run/2916 - [x] Update CHANGELOG.md to include this protocol feature in the `Unreleased` section.
The code is literally removing `protocol_feature_flat_state` and moving feature to stable protocol. We also disable `test_state_sync` as this is part of refactor we can do in Q2. ## Feature to stabilize Here we stabilize Flat Storage for reads, which means that all state reads in the client during block processing will query flat storage instead of Trie. Flat Storage is another index for blockchain state, reducing number of DB accesses for state read from `2 * key.len()` in the worst case to 2. This will trigger background creation of flat storage, using 8 threads and finishing in 15h for RPC node and 2d for archival node. After that all non-contract reads will go through flat storage, for which special "chunk views" will be created. When protocol upgrade happens, contracts reads will go through flat storage as well. Also compute costs will change as Option 3 suggests [here](#8006 (comment)). It is to be merged separately, but we need to ensure that both costs change and flat storage go into next release together. ## Context Find more details in: - Overview: https://near.github.io/nearcore/architecture/storage/flat_storage.html - Approved NEP: https://github.com/near/NEPs/blob/master/neps/nep-0339.md - Tracking issue: #7327 ## Testing and QA * Flat storage structs are covered by unit tests; * Integration tests check that chain behaviour is preserved and costs are changed as expected; * Flat storage spent ~2 months in betanet with assertion that flat and trie `ValueRef`s are the same; * We were running testnet/mainnet nodes for ~2 months with the same assertion. We checked that performance is not degraded, see e.g. https://nearinc.grafana.net/d/Vg9SREA4k/flat-storage-test?orgId=1&var-chain_id=mainnet&var-node_id=logunov-mainnet-fs-1&from=1677804289279&to=1678088806154 checking that even with finality lag of 50 blocks performance is not impacted. Small exception is that we updated data layout several times during development, but we checked that results are unchanged. ## Checklist - [x] Include compute costs after they are merged - #8924 - [x] https://nayduck.near.org/#/run/2916 - [x] Update CHANGELOG.md to include this protocol feature in the `Unreleased` section.
This is a protocol feature, updating the storage cost as agreed in #8006 to allow flat storage to be deployed without undercharging risks.
The code is literally removing `protocol_feature_flat_state` and moving feature to stable protocol. We also disable `test_state_sync` as this is part of refactor we can do in Q2. ## Feature to stabilize Here we stabilize Flat Storage for reads, which means that all state reads in the client during block processing will query flat storage instead of Trie. Flat Storage is another index for blockchain state, reducing number of DB accesses for state read from `2 * key.len()` in the worst case to 2. This will trigger background creation of flat storage, using 8 threads and finishing in 15h for RPC node and 2d for archival node. After that all non-contract reads will go through flat storage, for which special "chunk views" will be created. When protocol upgrade happens, contracts reads will go through flat storage as well. Also compute costs will change as Option 3 suggests [here](#8006 (comment)). It is to be merged separately, but we need to ensure that both costs change and flat storage go into next release together. ## Context Find more details in: - Overview: https://near.github.io/nearcore/architecture/storage/flat_storage.html - Approved NEP: https://github.com/near/NEPs/blob/master/neps/nep-0339.md - Tracking issue: #7327 ## Testing and QA * Flat storage structs are covered by unit tests; * Integration tests check that chain behaviour is preserved and costs are changed as expected; * Flat storage spent ~2 months in betanet with assertion that flat and trie `ValueRef`s are the same; * We were running testnet/mainnet nodes for ~2 months with the same assertion. We checked that performance is not degraded, see e.g. https://nearinc.grafana.net/d/Vg9SREA4k/flat-storage-test?orgId=1&var-chain_id=mainnet&var-node_id=logunov-mainnet-fs-1&from=1677804289279&to=1678088806154 checking that even with finality lag of 50 blocks performance is not impacted. Small exception is that we updated data layout several times during development, but we checked that results are unchanged. ## Checklist - [x] Include compute costs after they are merged - #8924 - [x] https://nayduck.near.org/#/run/2916 - [x] Update CHANGELOG.md to include this protocol feature in the `Unreleased` section.
Storage operations will need new gas parameters post flat state.
We have to evaluate its performance from several perspectives to set them.
This issue description succinctly summarizes the results in one location and is edited with the newest results as the data is collected.
Detailed data and discussion goes into comments.
What is the theoretical performance expected from FS?
We expect one
FlatState
and oneState
lookup for every read. Assuming 100us per DB access, this results in 0.2 Tgas total cost. However, rpc nodes and especially archival nodes are significantly slower for theState
column. According to that we should rather go with 2 * 1100us. But if we keep relying on good trie node caches, as we do today, then 0.2 Tgas total cost on average is still the expectation.What is the observed latency for
DBCol::FlatState
reads and how does it compare toDBCol::State
?DBCol::FlatState
is faster than state, sitting around 50us +/- 20usOn RPC nodes, we see about 20 times faster latencies on average, and 4 times faster on the 99th percentile (= uncached reads).
In absolute terms, the average is below 100us but the 99th percentile is above 1000us. The block cache really matters here.
For archival nodes, FS is basically the same. Compared to
State
, it between 50 times and100 times faster on average and on the 99th percentile, both testnet and mainnet.Absolute latencies are around 50us, +/- 20us.
What are additional DB costs of FS, aside from the
DBCol::FlatState
lookup?GOOD: In normal operation mode, even with high load and cold caches, all "constant" overhead seems to be small enough.
OPEN QUESTION: How bad is the overhead when flat head lags behind? In terms of DB load, it doesn't change because we keep everything in memory. But we should measure the CPU overhead in the estimator framework.
When running in an isolated setup integrated in the runtime-params-estimator, what is the read latency?
In summary, what are the suggested new storage cost parameters?
The text was updated successfully, but these errors were encountered: