Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower storage compute costs #12044

Merged
merged 4 commits into from
Sep 12, 2024
Merged

Lower storage compute costs #12044

merged 4 commits into from
Sep 12, 2024

Conversation

Ekleog
Copy link
Contributor

@Ekleog Ekleog commented Sep 5, 2024

See #11925 for all the gory details.

The major points are:

  • We currently overestimate read_base and has_key_base by 150G. This is to handle the fact that we don't actually have ReadTTN yet. Once we have ReadTTN, we can reduce these numbers to match estimator outputs again. But this will have to wait for a solution for RPC and archival nodes, that currently live on flat storage. The solution might be to enforce memtrie for them too, but it'd increase operational expenses.
  • Reduction of (Write)TTN is made possible by the fact that the critical path only has lots of (memtrie) reads, and a single write at the end of the block. The latency cost is accounted for in write_base, and RocksDB batches writes, so WriteTTN can be reasonably close to estimator results nowadays, even though we still have to take some variance in disk write throughput into account.
  • Once this lands, we should check whether blocks become limited by chunk witness size, or whether gas is still the limiting factor in most cases.
  • ReadCachedTrieNode was not reduced due to concerns about how it is actually implemented with memtrie.
  • We had to set some gas costs to 1. This is because our current way of accounting for compute costs is to take the gas costs, divide by the gas cost, and multiply by the compute cost. So our current code does not support a gas cost of 0 with a non-zero compute costs, and changing that would require refactoring.
  • All raw estimator results and graphs from which the numbers were derived are available on Gas costs post-memtrie #11925.

cc @Ekleog-NEAR

@Ekleog Ekleog requested a review from a team as a code owner September 5, 2024 07:20
@Ekleog Ekleog requested a review from akhi3030 September 5, 2024 07:20
@Ekleog Ekleog marked this pull request as draft September 5, 2024 07:21
Some(read) => {
let read_len = read.len() as u64;
// TODO BEFORE UNDRAFTING: REPLACE WITH THE RIGHT CONSTANT USAGE
if read_len < 4096 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pugachAG Do you know where the 4k constant I've been told is the limit to load for memtrie lives? I've tried digging around core/store/src/trie/mem but haven't been able to find the place in loading the trie that would limit to 4k, and thus to reuse the constant from there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I know inlined values are taken from the flat storage when loading the memtrie, so the constant is INLINE_DISK_VALUE_THRESHOLD

@Ekleog Ekleog force-pushed the update-costs branch 2 times, most recently from 3bb82d8 to 018ef57 Compare September 9, 2024 03:39
@Ekleog-NEAR
Copy link
Collaborator

I'm setting small read base to 159G on this PR. This is 9G of small read base, plus 6G of TTN multiplied by 25 of overestimated trie depth.

I think we should be fine even if trie depth increases to more than 25, because 6G of TTN is probably overestimated due to it being a WriteTTN.

We should be able to reduce these costs to much lower on average (because not all trie branches are that deep), once we manage to implement ReadTTN.

@Ekleog-NEAR Ekleog-NEAR marked this pull request as ready for review September 9, 2024 03:43
@Ekleog Ekleog force-pushed the update-costs branch 4 times, most recently from f1a1dc4 to c40e889 Compare September 10, 2024 11:12
@Ekleog-NEAR
Copy link
Collaborator

This should be ready for review. All tests passed locally, hopefully they will pass on CI too :)

Copy link

codecov bot commented Sep 10, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 7 lines in your changes missing coverage. Please review.

Project coverage is 71.56%. Comparing base (3718df6) to head (adc3479).
Report is 16 commits behind head on master.

Files with missing lines Patch % Lines
runtime/near-vm-runner/src/logic/logic.rs 61.53% 0 Missing and 5 partials ⚠️
core/parameters/src/view.rs 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12044      +/-   ##
==========================================
+ Coverage   71.40%   71.56%   +0.15%     
==========================================
  Files         814      815       +1     
  Lines      163936   164304     +368     
  Branches   163936   164304     +368     
==========================================
+ Hits       117055   117578     +523     
+ Misses      41754    41582     -172     
- Partials     5127     5144      +17     
Flag Coverage Δ
backward-compatibility 0.17% <0.00%> (-0.01%) ⬇️
db-migration 0.17% <0.00%> (-0.01%) ⬇️
genesis-check 1.26% <0.00%> (-0.01%) ⬇️
integration-tests 38.69% <32.65%> (+0.12%) ⬆️
linux 71.20% <85.71%> (+<0.01%) ⬆️
linux-nightly 71.11% <81.08%> (+0.12%) ⬆️
macos 53.43% <71.42%> (+0.59%) ⬆️
pytests 1.53% <0.00%> (-0.01%) ⬇️
sanity-checks 1.33% <0.00%> (-0.01%) ⬇️
unittests 65.32% <83.67%> (+0.14%) ⬆️
upgradability 0.22% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

main_storage_proof_size_soft_limit: {old: 3_000_000, new: 4_000_000}
wasm_storage_has_key_base: { old: { gas: 54_039_896_625, compute: 200_000_000_000 }, new: { gas: 54_039_896_625, compute: 8_000_000_000 } }
Copy link
Contributor

@pugachAG pugachAG Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be the same as wasm_storage_small_read_base

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you! This is now fixed :)
(For people who were not in our DMs, I forgot to include the ReadTTN overestimated costs in has_key_base, this is now fixed by making it 158Ggas)

Copy link
Contributor

@pugachAG pugachAG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please elaborate on motivation for introducing 3 additional costs for reading small values?
As far as I see it we only need to add one compute-only cost to account for overhead of reading large value ( > 4kb) from the disk.

main_storage_proof_size_soft_limit: {old: 3_000_000, new: 4_000_000}
wasm_storage_has_key_base: { old: { gas: 54_039_896_625, compute: 200_000_000_000 }, new: { gas: 54_039_896_625, compute: 8_000_000_000 } }
wasm_storage_has_key_byte: { old: 30_790_845, new: { gas: 30_790_845, compute: 9_000_000 } }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I understand this should be the same as wasm_storage_small_read_key_byte

Copy link
Collaborator

@Ekleog-NEAR Ekleog-NEAR Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I just checked the raw data from the estimator, and it seems reasonable enough to set both at 10Mgas, there's a small performance difference but not that noticeable. Adjusting the numbers :)

@bowenwang1996
Copy link
Collaborator

@Ekleog-NEAR do you have an example of reduction in compute cost? For example, what is the new compute cost vs. old compute cost in typical workload such as fungible token transfers?

@Ekleog-NEAR
Copy link
Collaborator

Ekleog-NEAR commented Sep 11, 2024

could you please elaborate on motivation for introducing 3 additional costs for reading small values?
As far as I see it we only need to add one compute-only cost to account for overhead of reading large value ( > 4kb) from the disk.

Firstly, I think we do need at least large/small distinction at least for base and value_byte costs. Because both the base latency and the throughput are different between on-disk and in-memory trie.

Also, you mentioned compute-only. While we could do it that way, it would make it harder to eventually, maybe, reduce gas costs of memtrie reads later in the future.

As for small_read_key_byte, I agree with you that we could probably do without it. However, I do think that it makes sense to add it, for consistency: if we had two small_read costs but not the third one, it would likely be surprising to future readers of the code. Considering having one additional cost is not a lot of lines of code, I feel like having it reduces tech debt by making the costs more straightforward.

IOW, I think it's better to semantically have two sets of three costs, one "small reads" and one "large reads"; than it'd be to have "read base, read key byte, read value byte, read overcost for large value base, read overcost for large value per value byte" costs. The drawback is, as you could see, that there's a bit of juggling with the can_burn_gas function when charging for it.

Does that make sense, or do you think the other alternative would be more semantically useful for end-users?

@Ekleog-NEAR
Copy link
Collaborator

Ekleog-NEAR commented Sep 11, 2024

I just pushed an update that removes the small_read costs, and instead adds two large_read_overhead costs as per the videochat with Anton.

Hopefully, with this change this is now ready to land :)

@Ekleog Ekleog force-pushed the update-costs branch 2 times, most recently from cb67708 to 677166a Compare September 11, 2024 12:24
@Ekleog-NEAR
Copy link
Collaborator

The tests all pass, and the PR seems ready for re-review to me :)

core/parameters/res/runtime_configs/61.yaml Outdated Show resolved Hide resolved
core/parameters/res/runtime_configs/72.yaml Show resolved Hide resolved
runtime/near-vm-runner/src/logic/gas_counter.rs Outdated Show resolved Hide resolved
runtime/near-vm-runner/src/logic/logic.rs Outdated Show resolved Hide resolved
@pugachAG
Copy link
Contributor

@Ekleog also could you add a short summary in the description for this PR, #11925 has so many comments that it no longer works as a nice reference for that

@Ekleog-NEAR
Copy link
Collaborator

Ekleog-NEAR commented Sep 12, 2024

@bowenwang1996 To answer your question, storage costs reduce on real-world use cases by 50%. The storage gas costs are themselves around 25-50% of the total gas costs, but they are also the only ones with higher compute costs.

Overall, taking everything into account, on a few randomly-chosen receipts (4 HOT receipts and 3 sweat receipts), I'm seeing a pretty stable ~30-40% improvement. This, in turn, means that the chain throughput will increase by roughly 50% after this change reaches mainnet.

The detailed data is available here. (Note: the above analysis assumes that non-function-call receipts amount to a negligible volume of compute costs)

Considering these pretty high numbers, it's particularly important that we run this on a mocknet with transaction relayer before pushing in production, but AFAICT we always do that on every release anyway, so we should be fine.

@Ekleog-NEAR Ekleog-NEAR added this pull request to the merge queue Sep 12, 2024
Merged via the queue into near:master with commit 52b4684 Sep 12, 2024
28 of 29 checks passed
@Ekleog-NEAR
Copy link
Collaborator

Ekleog-NEAR commented Oct 2, 2024

I tried running the before/after of this change on a full forknet setup, that would include state witness size limit and not just compute costs limits as the above analysis.

The first thing to note is, that our current forknet infrastructure seems to be unable to saturate the chain. It's apparently pushing transactions too slowly, and thus failing to saturate the chain even when asked for the maximum speed of pushing.

So these numbers should be understood as average numbers, more than as improvements under serious load. I would expect improvements under serious load to be significantly higher, because as soon as the blocks are not full, compute costs changes will have no impact.

All this forewarning being said, here are the numbers I found. The full runs are available:

I did three runs of forknet:

  • the first run is with the default transaction speed, ie. the speed at which transactions arrived on mainnet when it was recorded
  • the second run is with 500ms between each batch of transactions, ie. around 2x faster than mainnet speed during the recording
  • the third run is with 100ms between each batch of transactions. It should be around 10x faster than mainnet speed, but seeing the graphs I think there's something wrong, and it probably was limited by some other factor, like speed of actual transaction sending
  • I also tried a fourth run with 1ms between each batch of transactions, but the benchmark system actually crashed early on. I did not investigate exactly why yet, because a similar issue was being investigated as part of the forknet effort, and backporting to rerun the tests might not make sense considering 100ms already was seeming to be bottlenecked on performance

Screenshot from 2024-10-02 19-49-13
Screenshot from 2024-10-02 19-49-20

Gas usage per shard seems to increase by ~10% with the patch. This would mean a ~10% increased throughput.

However, this seems like a random fluctuation to me, because delayed receipts mostly stay at zero and transactions processing never is limited in both cases.


Screenshot from 2024-10-02 19-50-04
Screenshot from 2024-10-02 19-50-14

We can see here why I think that the forknet system fails to throttle the system. Especially with 10x the mainnet speed, I would expect to see many more blocks limited by either compute or state witness than with 2x the mainnet speed. But the numbers are roughly the same, which seems to indicate to me that the transaction replayer can actually replay only at ~2x mainnet speed (or even less).

There is also not much difference here between before and after the change, which is consistent with the chain not actually being under pressure.


Screenshot from 2024-10-02 20-04-28
Screenshot from 2024-10-02 20-04-45

Finally, the delayed receipts might be the most interesting of the graphs, but also the hardest one to interpret. Here we can see:

  • Overall more delayed receipts than before, but also
  • Peaks that resorb faster than before (6 minutes vs. 3 minutes to recover)

To be honest, I don't really know how to interpret that, with the knowledge that the only thing that changed was compute costs. It might all be spurious changes due to random variations of the speed at which the traffic generator node pushed transactions.

However, I will say that all of these graphs are essentially indicative of a mostly healthy chain, as delayed receipts were very limited in numbers and transactions processing never limited.


Summary

Altogether, running with using forknet as a benchmarking system did not actually answer the question of how much performance will change after this code reaches mainnet. The theoretical computation above would probably stay the best numbers we can have for now, because forknet is unable to actually push the chain enough to hit the compute costs limits.

Along with my experience using forknet to benchmark, I think that we need to:

  1. Make forknet easier to use for benchmarking. This will involve changes to terraform scripts as well as the mirror tool, to require less domain-specific knowledge to run
  2. Improve the speed at which forknet can push transactions. This will likely involve improving on the performance of the traffic generator node. Only once the traffic generator is able to push more transactions than our system is able to process, will it become a good benchmarking system

As written in my last weekly sync message, I will start working on this from now on, so that we can get good numbers close to real-world scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants