-
Notifications
You must be signed in to change notification settings - Fork 677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find Optimizations in the current Clarity VM via benchmarking current VM and Wasm #4316
Comments
Summary of differences from
|
Symbol | Meaning |
---|---|
+ | Minor cost increase |
++ | Cost increase |
+++ | Major cost increase |
- | Minor cost decrease |
-- | Cost decrease |
--- | Major cost decrease |
Costs
Only the functions that have different cost values are reported here:
Function | costs-3 |
costs-4 |
Interpretation |
---|---|---|---|
cost_analysis_type_annotate |
(runtime (linear n u1 u9)) |
(runtime (linear n u1 u10)) |
+ |
cost_analysis_type_check |
(runtime (linear n u113 u1)) |
(runtime (linear n u115 u0)) |
? |
cost_analysis_list_items_check |
(runtime (linear n u2 u4)) |
(runtime (linear n u2 u8))) |
+ |
cost_analysis_check_tuple_cons |
(runtime (nlogn n u3 u5)) |
(runtime (nlogn n u3 u142)) |
+++ |
cost_analysis_tuple_items_check |
(runtime (linear n u1 u28)) |
(runtime (linear n u2 u52)) |
++ |
cost_analysis_lookup_variable_const |
(runtime u15) |
(runtime u16) |
+ |
cost_ast_cycle_detection |
(runtime (linear n u141 u72)) |
(runtime (linear n u156 u23)) |
? |
cost_analysis_fetch_contract_entry |
read_length: (linear n u1 u1) |
read_count: (linear n u1 u1) |
Bug? read_count duplicated |
cost_lookup_variable_size |
(runtime (linear n u2 u1)) |
(runtime (linear n u2 u18)) |
++ |
cost_user_function_application |
(runtime (linear n u26 u5)) |
(runtime (linear n u27 u5)) |
+ |
cost_tuple_merge |
(runtime (linear n u4 u408)) |
(runtime (linear n u5 u408)) |
+ |
cost_xor |
(runtime (linear n u15 u129)) |
(runtime u139) |
- |
cost_eq |
(runtime (linear n u7 u151)) |
(runtime (linear n u8 u151)) |
+ |
cost_sha256 |
(runtime (linear n u1 u100)) |
(runtime (linear n u1 u180)) |
++ |
cost_sha512 |
(runtime (linear n u1 u176)) |
(runtime (linear n u1 u216)) |
++ |
cost_as_max_len |
(runtime u475) |
(runtime u518) |
+ |
cost_contract_of |
(runtime u13400) |
(runtime u19002) |
++ |
cost_block_info |
runtime: u6321 |
runtime: u7841 |
++ |
cost_stx_balance |
runtime: u4294 |
runtime: u1712 |
-- |
cost_stx_transfer |
runtime: u4640 |
runtime: u1672 |
-- |
cost_nft_owner |
runtime: (linear n u9 u795) |
runtime: (linear n u10 u741) |
? |
cost_burn_block_info |
runtime: u96479 |
runtime: u105877 |
+ |
cost_stx_account |
runtime: u4654 |
runtime: u2021 |
-- |
cost_stx_transfer_memo |
runtime: u4709 |
runtime: u1761 |
-- |
cost_bitwise_and |
(runtime (linear n u15 u129)) |
No output | |
cost_bitwise_or |
(runtime (linear n u15 u129)) |
No output | |
cost_bitwise_not |
(runtime u147) |
No output | |
cost_bitwise_left_shift |
(runtime u167) |
No output | |
cost_bitwise_right_shift |
(runtime u167) |
No output |
Analysis
Many (most?) costs have not changed between costs-3
and costs-4
.
For those that have changed, most show an increase in cost.
A few however, notably the stx_*
functions dealing with STX accounts and and balances, have shown a significant decrease.
Tweaking the read limits to run another round of benchmarking |
Here are my thoughts about changes to the mechanism for setting the cost limits in Nakamoto that I believe require some discussion. In the previous cost-limit calculations, the goal is to determine what is a reasonable limit on the various cost components that would allow some reasonable hardware to process the block in 30 seconds. The 30 second time limit is chosen based on the probability that ~95% of bitcoin blocks will arrive in > 30 seconds. In practice, that means that a 20 minute block is limited to the cost budget of a 30 second block -- an overly cautious and pessimistic situation. This difficulty in defining a limit to handle the majority of cases can be improved upon in Nakamoto due to the inclusion of the tenure extend functionality. Thus far, this tenure extend functionality has been discussed to refresh the block budget after 10 minutes, to allow for better handling of long block times (~37% of blocks are greater than 10 minutes). My proposal here is that we could get much better results if instead, the intention is to extend the tenure in smaller chunks of time, so that the budget can more closely match the actual time, allowing for the appropriate limitations such that slower nodes can keep up with the network, but without being overly cautious and penalizing the network for the ~95% of blocks that are > 30s. The proposal then is to continue to choose the limits based on the 30s block time, but then issue tenure extensions more often. If tenure extensions were issued every 30s, matching the budget limits, we would end up in a case where slower nodes cannot keep up, but if they are issued every minute, this guarantees that the budget accomplishes its goal while also 10x-ing the budget for an average 10 minute block. The other advantage of this strategy is that it further encourages miners to continue to mine blocks throughout their entire tenure. With the current strategy, a fast miner with a long bitcoin block could spend its entire budget very quickly, then be unable to mine any more blocks until 10 minutes has passed and it processes a tenure extension to refresh its budget. This is bad for transaction latency. With this proposed model, the fast miner gets a fresh budget every minute. I don't believe this proposal should cause any new work or new problems, since we already planned to support tenure extensions, and the frequency of these extensions (1 minute) is still plenty high that it should not cause any problems with network congestion or other similar concerns. The one downside to maintaining a block limit based on a 30 second block is that it limits the cost of individual transactions. For example, a budget based on 10 minute blocks would allow for a single transaction that used the entire budget to be 20x larger than a single transaction using the entire 30s budget. In practice, the 30 second budget I am proposing will already be larger than the current budget, and thus far, it has been feasible to structure all useful transactions such that they can fit into these current blocks. |
The other major concern is initial boot-up time. The time to process a block should be considerably faster (e.g. by orders of magnitude) than the time to mine it. Otherwise, it could take months or even years for a node to catch up with the chain tip.
If you make the tenure budget too small, then certain kinds of transactions that are valid today and have been mined today would cease to be mineable. There are some today that can take double-digit percentages of a block.
We can do this without issuing tenure-extensions and hitting the above problem: signers throttle miners based on the rate of resource consumption. Signers can track how much of each dimension the miners use, and if they exceed a rate of |
There is overhead in selecting transactions to include in the block, but the actual execution of the transactions should be identical between mining and processing, no?
I addressed this in the last paragraph. The 30s budget that we choose would be greater than the current budget, so this should not be a problem. I am not suggesting to shrink this further, for example down to 5s limits as might be a logical step.
I agree, that we could do throttling in this way, but then this needs to be separate logic in the signer to do this, while my proposal doesn't require any additional logic over that which is already needed for the 10 minute tenure extension. Is there another benefit to taking the approach you've described that makes it the better option? |
In regards to profiling, just as an FYI since this issue seems relatively generic - I'm currently working on some node db/storage/serialization benching as a result of performance issues I've come across in the a/b tester, particularly in regard to needing to replay chainstate multiple times during development which takes way too much time as it is now. For anybody else doing profiling/benching: I've also got a branch with tracing implemented with custom layers I've been using to time/visualize the call chain a little easier if that's of interest to anybody. Example: |
|
Some more ideas for optimization: |
I am going to start working on a PR tomorrow for updating the storage and retrieval of contracts + analysis. According to benchmarks, contract storage and read performance is greatly improved by storing contract data in a single table instead of as 5+ key-value entries. Also, using messagepack serialization (instead of json) + lz4 compression for the contract AST + analysis is almost twice as fast (and twice as small) as the current solution. |
Just FYI, messagepack is not 1.0 yet, so I'd shy away from using it. Things could break. Also, I'd recommend |
I've used it in past projects with great success and the current spec has been stable for seven years. As long as we don't randomly upgrade the dependency between spec changes without requiring node resyncs, it will be fine :) The arguably "better" alternative would of course be protobuf, but I personally think that the added complexity with protobuf could be added in a later iteration if even more performance needed to be squeezed out. The really nice thing with
I had actually missed that library when I looked at these benchmarks - but according to those, I'll create two separate PR's (one for DB and one for serialization) so that the topics can be discussed separately :) |
The arguably "better" alternative is to just bite the bullet and use You may have had success in using
When it comes to blockchains -- that is, systems which not only manage other peoples' money, but also offer exactly zero recourse in the event of a remotely-triggered memory corruption bug -- safety is infinitely more important than speed. CVEs like this are a very compelling reason to stay far, far away from packages that link |
I disagree. I think I'm not just trying to be pedantic here, either. I think its actually important to distinguish between serializations that are consensus-critical and those that are not because they have actually differing requirements. Consensus-critical serializations cannot change between versions without forking logic (and of course, a fork), so their implementations must be made cognizant of that. Non-consensus-critical serializations do not require this property (or at least, do not have anywhere near the same level of consequences for this property). They have other important properties, like they should be memory safe, but those are actually different properties, and I think could lead to other implementations that prioritize speed, overhead, and maintainability. Concerns about protobuf (or whatever else) may be valid with respect to consensus-critical serializations, but totally invalid when applied to non-consensus-critical serializations. Because its impossible to easily distinguish between which |
Sure, those are all fair points, and I agree on the top-level bit that we should distinguish between data structures that have consensus-critical byte representations (begetting consensus-critical codecs) and those that do not. I'd further differentiate between structures whose non-consensus-critical byte representations can be upgraded on a node-by-node basis (e.g. the DB representation) versus those which still require a network upgrade to roll out (e.g. p2p message packet representations). The point I was trying to make was less about forcing everything to use Is there a discussion somewhere on which structures can have non-consensus-critical DB representations in Clarity? |
I'm (obviously) in agreement with @kantal here. It would take a lot of (unnecessary) effort to write a correct and bug-free (de)serialization implementation using Regarding consensus-critical serialization, however, it frankly surprises me that the project isn't using a serialization framework like Protobuf which uses well-defined, versioned schemas and generated message contracts (
I didn't say that I've used MessagePack in personal projects - we've used it in mission-critical, event-driven applications processing millions of- and ~$1bn worth of transactions per annum without [serialization] issues, which is one of the reasons it's at the top of my list. If you research it you will also find that it's one of the most common binary serialization formats next to BSON, Avro, Protobuf, etc. As @kantal points out, I'm talking about a non-consensus-critical code path. As long as consensus-critical data can be re-hydrated to its consensus-critical representation, how it's persisted is a node implementation detail. Why I advocate for MessagePack over Protobuf for this path is I don't believe the benefits outweigh the added overhead of maintaining schemas and code-generation when it's not strictly needed - unless we would need to save on every byte and CPU cycle we can.
I'm aware of the security concerns and constraints of a blockchain - I'm just more accustomed to such things being handled using process and objective risk assessments. I have difficulty accepting past CVE's as a blanket argument because, well, it feels slightly hypocritical given the project's current dependencies... (just a quick search):
... and not a reliable predictor to the future. There are always going to be more bugs and more CVE's, even close to home. I've seen a number of critical home-grown bugs in the last year requiring node patching and/or forks; I can't say the same for issues with dependencies, but please correct me if I'm mistaken? When it comes to I'm not trying to say that new dependencies should just be allowed in "hey-ho" 🤠 - I'm arguing for objective risk management.
Not that I've seen, but in general anything stored in any of the SQLite databases can be physically represented in any format given the constraint that it can be re-hydrated to its consensus-critical format (which right now I'm simply using the assumption of that which goes into the persistence layer must come out the same way, to keep changes minimal). The primary two contenders are the contract AST and contract analysis, which are JSON-serialized and stored directly in SQLite in Clarity's Anyway, I feel like we've kind-of hijacked this issue... should we move it to a discussion thread instead? |
that's fair, but i think it's important to consider the blast radius of such changes - in a distributed system, a minor bug can (and does, as we've seen in the decent past) have an extremely large blast radius that affects nearly everyone using the chain. i think the unknown here is separating between consensus-critical and non-consensus-critical, and if that can be done reliably and predictably (particularly when adding new functions or refactoring/maintaining existing code). it's possible, but my concern is missing something in that refactor that is only apparent once it's released and thus much harder to fix.
I'll admit i'm not too familiar with the diff between the packages you're talking about here - but one thing that i have seen in the past is if we were to use an external package it definitely comes with increased risk and it's not entirely clear what the benefit is (perhaps you can share more about the expected benefits in a new discussion?).
This is, I think, the most important aspect of what you're proposing - how would an objective risk management work for proposed changes as described here? is that still unknown today, or maybe you have some ideas you're saving for a discussion post?
Please! i think this is good discussion to have and there are a lot viewpoints that we should consider vs the visibility in this specific issue. i'm particularly interested in the risk management you've mentioned, and if there are any ideas on how that could be accomplished for a decentralized project. so, i would recommend that we move this to a discussion (so we no longer have this issue hijacked), with the expectation that there may be little action until after the 2nd proposed hard fork later this year. |
To avoid further contaminating this issue I'll summarize all of the above in a separate github discussion when I have some time and we can continue there 😊 |
Performance UpdateHere is the total effect on block processing time of all the optimization PRs I've done so far (including #4408 which is not merged yet. Please review!): ❯ hyperfine -w 3 -r 20 "stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000" "stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000"
Benchmark 1: stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000
Time (mean ± σ): 12.164 s ± 0.049 s [User: 11.641 s, System: 0.472 s]
Range (min … max): 12.102 s … 12.324 s 20 runs
Benchmark 2: stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000
Time (mean ± σ): 8.789 s ± 0.035 s [User: 8.309 s, System: 0.434 s]
Range (min … max): 8.728 s … 8.901 s 20 runs
Summary
stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000 ran
1.38 ± 0.01 times faster than stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000 This benchmark shows a 38% improvement in processing speed over an arbitrary range of blocks. Since the build changes had by far the largest effect, we will probably see significant improvements in other areas, like API performance, also |
Could you try this over, say, 1000 blocks, and ideally closer to block 135k? Might take a couple hours, but it'll be a much better sample and thus indication of the performance gains. I think 135k should cover the introduction of ordinals... |
Eventually, but that would take a long time. I've been using a benchmark that is reasonably fast so that I can quickly test my changes
I will try this on the most recent blocks now that consensus bugs in |
@cylewitruk here's the performance on block range 145,000-145,050 with all current PRs (everything used in the previous comment, plus #4425): ❯ hyperfine -w 3 -r 10 "stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 145000 145050" "stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 145000 145050"
Benchmark 1: stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 145000 145050
Time (mean ± σ): 253.175 s ± 1.004 s [User: 245.166 s, System: 6.936 s]
Range (min … max): 250.886 s … 254.416 s 10 runs
Benchmark 2: stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 145000 145050
Time (mean ± σ): 178.373 s ± 0.311 s [User: 170.661 s, System: 6.766 s]
Range (min … max): 178.091 s … 178.977 s 10 runs
Summary
stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 145000 145050 ran
1.42 ± 0.01 times faster than stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 145000 145050 This is almost exactly what I see with these binaries on the previous block range (99,990-100,000): ❯ hyperfine -w 3 -r 10 "stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000" "stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000"
Benchmark 1: stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000
Time (mean ± σ): 12.160 s ± 0.030 s [User: 11.623 s, System: 0.490 s]
Range (min … max): 12.106 s … 12.200 s 10 runs
Benchmark 2: stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000
Time (mean ± σ): 8.594 s ± 0.009 s [User: 8.110 s, System: 0.441 s]
Range (min … max): 8.582 s … 8.606 s 10 runs
Summary
stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000 ran
1.41 ± 0.00 times faster than stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000 So we've got some nice performance gains here, that are consistent across modern Stacks blocks (as opposed to the much smaller gains we say on the 0-3,000 range) |
Very nice! Promising! 😄 I still do think it would be wise to run an overnight larger block range, simply because 10/50 blocks are statistically insignificant sample windows (~1½ / ~8 hours) - those block ranges could be (not saying are) coincidentally misrepresentative, so it would be good to get the averages over 24-48 hours (150-300 blocks) or more so that we know we're covering multiple time-zones, days of the week, etc. i.e. different behavioral patterns. |
Tasks
The text was updated successfully, but these errors were encountered: