Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Add a block production benchmark #10104

Conversation

koute
Copy link
Contributor

@koute koute commented Oct 27, 2021

This PR adds a block production benchmark.

Here are the results on my machine (Threadripper 3970X):

Block production/4673 transfers
    time:   [527.28 ms 528.41 ms 529.65 ms]
    thrpt:  [8.8228 Kelem/s 8.8435 Kelem/s 8.8624 Kelem/s]

cc @bkchr Before I go off and try to profile this I'd appreciate if you could take a look and tell me whenever I haven't done anything particularly stupid here.

Fixes #9978

@koute koute added A0-please_review Pull request needs code review. I5-tests Tests need fixing, improving or augmenting. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit labels Oct 27, 2021
@koute koute requested review from gilescope and bkchr October 27, 2021 06:35
bin/node/cli/benches/block_production.rs Outdated Show resolved Hide resolved
bin/node/cli/benches/block_production.rs Outdated Show resolved Hide resolved

// You should have received a copy of the GNU General Public License
// along with this program. If not, see <https://www.gnu.org/licenses/>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a small doc on what this benchmark does, and what assumptions it makes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.... well, I could certainly write something out, but considering how relatively simple the benchmark is (and the fact that I'm not yet super familiar with all of the internals) I'm not entirely sure what extra information I could add with such a comment. (:

Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, ty! Looks good now :)

let max_transfer_count = {
let mut transfer_count = 0;
let mut block_builder = client.new_block(Default::default()).unwrap();
block_builder.push(extrinsic_set_time(1 + 1500)).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
block_builder.push(extrinsic_set_time(1 + 1500)).unwrap();
block_builder.push(extrinsic_set_time(1 + MILLISECS_PER_BLOCK)).unwrap();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, you're right that I should use a proper constant for this, however this actually can't be MILLISECS_PER_BLOCK. (:

Since BABE isn't actually running and we're using the BlockBuilder directly to author the blocks the current timeslot in BABE doesn't actually progress, so the code fails with a Timestamp slot must match 'CurrentSlot' assertion if we actually set the time to the next block.

So what I'm doing here is incrementing the timestamp only by the MinimumPeriod (which is 1500, half of the
SLOT_DURATION/MILLISECS_PER_BLOCK), so BABE doesn't complain since it's still within the first timeslot.

If we had to actually import more blocks than the first one this would obviously not work, but we don't.

Is this actually a problem here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No should be fine.

b.iter_batched(
|| {
let mut extrinsics = Vec::with_capacity(max_transfer_count + 1);
extrinsics.push(extrinsic_set_time(1 + 1500));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extrinsics.push(extrinsic_set_time(1 + 1500));
extrinsics.push(extrinsic_set_time(1 + MILLISECS_PER_BLOCK));

use criterion::{criterion_group, criterion_main, BatchSize, Criterion, Throughput};

use node_cli::service::{create_extrinsic, FullClient};
use node_runtime::{constants::currency::*, BalancesCall};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use node_runtime::{constants::currency::*, BalancesCall};
use node_runtime::{constants::{currency::*, time::MILLISECS_PER_BLOCK}, BalancesCall};

bin/node/cli/benches/block_production.rs Outdated Show resolved Hide resolved
bin/node/cli/benches/block_production.rs Outdated Show resolved Hide resolved
@koute
Copy link
Contributor Author

koute commented Nov 1, 2021

Benchmark result after switching to WASM:

Block production/4673 transfers                                                                          
    time:   [6.4376 s 6.4754 s 6.5188 s]
    thrpt:  [716.85  elem/s 721.65  elem/s 725.89  elem/s]
change:
    time:   [+1133.4% +1152.0% +1169.3%] (p = 0.00 < 0.05)
    thrpt:  [-92.122% -92.012% -91.892%]
    Performance has regressed.

@koute
Copy link
Contributor Author

koute commented Nov 2, 2021

Results with the execution switched to Compiled: (it was on Interpreted before)

Block production/4673 transfers
    time:   [1.3598 s 1.3648 s 1.3697 s]
    thrpt:  [3.4116 Kelem/s 3.4240 Kelem/s 3.4366 Kelem/s]
change:
    time:   [-80.104% -79.971% -79.813%] (p = 0.00 < 0.05)
    thrpt:  [+395.37% +399.28% +402.62%]
Performance has improved.

@bkchr
Copy link
Member

bkchr commented Nov 2, 2021

Hmm, that is much better than I expected :D We may need to use the entire basic authorship machinery, but for the beginning it is probably okay.

@koute
Copy link
Contributor Author

koute commented Nov 5, 2021

Also, mostly just for fun I've checked - bumping wasmtime from 0.30 to 0.31 (from my other PR) increases performance by ~6% (at least on my machine):

Block production/4673 transfers
    time:   [1.2514 s 1.2534 s 1.2556 s]
    thrpt:  [3.7217 Kelem/s 3.7283 Kelem/s 3.7344 Kelem/s]
change:
    time:   [-7.0532% -6.4367% -5.5795%] (p = 0.00 < 0.05)
    thrpt:  [+5.9092% +6.8795% +7.5884%]
    Performance has improved.

Creating all of those extrinsics takes up *a lot* of time, up to the point
where the majority of the time is actually spent *outside* of the code
which we want to benchmark here. So let's only do it once.
@bkchr
Copy link
Member

bkchr commented Nov 5, 2021

Nice :P

I meant we should switch using basic-authorhip, as shown here: https://github.com/paritytech/substrate/tree/master/client/basic-authorship

This makes it more complicated, however I assume that this will slow it down a little bit. If not, our code seems to be good enough performancewise :P

@koute
Copy link
Contributor Author

koute commented Nov 5, 2021

I meant we should switch using basic-authorhip, as shown here: https://github.com/paritytech/substrate/tree/master/client/basic-authorship

Hmm... okay, so how about we add this benchmark mostly as-is, and add another one with that extra machinery included? (I think this benchmark should still be somewhat useful anyway? If nothing else it can be used to easily compare native/interpreted/compiled executions and/or check if new versions of wasmtime have any improvements.) Does that sound good?

@bkchr
Copy link
Member

bkchr commented Nov 5, 2021

Yeah, as already said before, I'm fine with merging as is :)

@bkchr
Copy link
Member

bkchr commented Nov 5, 2021

I'm more "worried" that there are no optimization opportunities xD

@bkchr
Copy link
Member

bkchr commented Nov 5, 2021

Okay, one more thing should be done. Please add a second variant that is building blocks with proof generation enabled. (this is just a setting of the block builder)

@koute
Copy link
Contributor Author

koute commented Nov 9, 2021

Added a variant of the benchmark with proof recording; the difference in performance is minimal:

Block production/4673 transfers (no proof)
                        time:   [1.3151 s 1.3181 s 1.3210 s]
                        thrpt:  [3.5373 Kelem/s 3.5452 Kelem/s 3.5533 Kelem/s]

Block production/4673 transfers (with proof)
                        time:   [1.3597 s 1.3619 s 1.3641 s]
                        thrpt:  [3.4257 Kelem/s 3.4312 Kelem/s 3.4367 Kelem/s]

informant_output_format: Default::default(),
wasm_runtime_overrides: None,
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like Configuration should really have a Default impl. Then you can Configuration { impl_name: "BenchmarkImpl".into(), ..Default::default }.
I feel like we don't care about most of these details here.

@bkchr
Copy link
Member

bkchr commented Nov 9, 2021

Added a variant of the benchmark with proof recording; the difference in performance is minimal:

Block production/4673 transfers (no proof)
                        time:   [1.3151 s 1.3181 s 1.3210 s]
                        thrpt:  [3.5373 Kelem/s 3.5452 Kelem/s 3.5533 Kelem/s]

Block production/4673 transfers (with proof)
                        time:   [1.3597 s 1.3619 s 1.3641 s]
                        thrpt:  [3.4257 Kelem/s 3.4312 Kelem/s 3.4367 Kelem/s]

Ahh yeah, we always read the same 2 positions. So, this is cached :P But let us merge this and then in a follow up we should try to use unique accounts for sending and receiving, for every tx.

@koute
Copy link
Contributor Author

koute commented Nov 9, 2021

Sounds good to me; I'll add the unique accounts in a separate PR.

@koute
Copy link
Contributor Author

koute commented Nov 9, 2021

bot merge

@paritytech-processbot paritytech-processbot bot merged commit 800fac1 into paritytech:master Nov 9, 2021
grishasobol pushed a commit to gear-tech/substrate that referenced this pull request Mar 28, 2022
* Add a block production benchmark

* Simplify the block production benchmark

* Cleanups; switch execution strategy to WASM

* Switch WASM execution to `Compiled`

* Reduce the setup cost of the benchmark

Creating all of those extrinsics takes up *a lot* of time, up to the point
where the majority of the time is actually spent *outside* of the code
which we want to benchmark here. So let's only do it once.

* Add a variant of the block production benchmark with proof recording
ark0f pushed a commit to gear-tech/substrate that referenced this pull request Feb 27, 2023
* Add a block production benchmark

* Simplify the block production benchmark

* Cleanups; switch execution strategy to WASM

* Switch WASM execution to `Compiled`

* Reduce the setup cost of the benchmark

Creating all of those extrinsics takes up *a lot* of time, up to the point
where the majority of the time is actually spent *outside* of the code
which we want to benchmark here. So let's only do it once.

* Add a variant of the block production benchmark with proof recording
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit I5-tests Tests need fixing, improving or augmenting.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create block production benchmark
4 participants