Add a block production benchmark #10104

koute · 2021-10-27T06:35:04Z

This PR adds a block production benchmark.

Here are the results on my machine (Threadripper 3970X):

Block production/4673 transfers
    time:   [527.28 ms 528.41 ms 529.65 ms]
    thrpt:  [8.8228 Kelem/s 8.8435 Kelem/s 8.8624 Kelem/s]

cc @bkchr Before I go off and try to profile this I'd appreciate if you could take a look and tell me whenever I haven't done anything particularly stupid here.

Fixes #9978

bin/node/cli/benches/block_production.rs

shawntabrizi · 2021-10-31T10:42:27Z

bin/node/cli/benches/block_production.rs

+
+// You should have received a copy of the GNU General Public License
+// along with this program. If not, see <https://www.gnu.org/licenses/>.
+


Maybe a small doc on what this benchmark does, and what assumptions it makes?

Hmm.... well, I could certainly write something out, but considering how relatively simple the benchmark is (and the fact that I'm not yet super familiar with all of the internals) I'm not entirely sure what extra information I could add with such a comment. (:

bkchr

Nice, ty! Looks good now :)

bkchr · 2021-11-01T11:47:24Z

bin/node/cli/benches/block_production.rs

+	let max_transfer_count = {
+		let mut transfer_count = 0;
+		let mut block_builder = client.new_block(Default::default()).unwrap();
+		block_builder.push(extrinsic_set_time(1 + 1500)).unwrap();


Suggested change

block_builder.push(extrinsic_set_time(1 + 1500)).unwrap();

block_builder.push(extrinsic_set_time(1 + MILLISECS_PER_BLOCK)).unwrap();

Ah, right, you're right that I should use a proper constant for this, however this actually can't be MILLISECS_PER_BLOCK. (:

Since BABE isn't actually running and we're using the BlockBuilder directly to author the blocks the current timeslot in BABE doesn't actually progress, so the code fails with a Timestamp slot must match 'CurrentSlot' assertion if we actually set the time to the next block.

So what I'm doing here is incrementing the timestamp only by the MinimumPeriod (which is 1500, half of the
SLOT_DURATION/MILLISECS_PER_BLOCK), so BABE doesn't complain since it's still within the first timeslot.

If we had to actually import more blocks than the first one this would obviously not work, but we don't.

Is this actually a problem here?

No should be fine.

bkchr · 2021-11-01T11:47:42Z

bin/node/cli/benches/block_production.rs

+		b.iter_batched(
+			|| {
+				let mut extrinsics = Vec::with_capacity(max_transfer_count + 1);
+				extrinsics.push(extrinsic_set_time(1 + 1500));


Suggested change

extrinsics.push(extrinsic_set_time(1 + 1500));

extrinsics.push(extrinsic_set_time(1 + MILLISECS_PER_BLOCK));

bkchr · 2021-11-01T11:48:31Z

bin/node/cli/benches/block_production.rs

+use criterion::{criterion_group, criterion_main, BatchSize, Criterion, Throughput};
+
+use node_cli::service::{create_extrinsic, FullClient};
+use node_runtime::{constants::currency::*, BalancesCall};


Suggested change

use node_runtime::{constants::currency::*, BalancesCall};

use node_runtime::{constants::{currency::*, time::MILLISECS_PER_BLOCK}, BalancesCall};

bin/node/cli/benches/block_production.rs

koute · 2021-11-01T13:08:13Z

Benchmark result after switching to WASM:

Block production/4673 transfers                                                                          
    time:   [6.4376 s 6.4754 s 6.5188 s]
    thrpt:  [716.85  elem/s 721.65  elem/s 725.89  elem/s]
change:
    time:   [+1133.4% +1152.0% +1169.3%] (p = 0.00 < 0.05)
    thrpt:  [-92.122% -92.012% -91.892%]
    Performance has regressed.

koute · 2021-11-02T11:43:40Z

Results with the execution switched to Compiled: (it was on Interpreted before)

Block production/4673 transfers
    time:   [1.3598 s 1.3648 s 1.3697 s]
    thrpt:  [3.4116 Kelem/s 3.4240 Kelem/s 3.4366 Kelem/s]
change:
    time:   [-80.104% -79.971% -79.813%] (p = 0.00 < 0.05)
    thrpt:  [+395.37% +399.28% +402.62%]
Performance has improved.

bkchr · 2021-11-02T21:16:32Z

Hmm, that is much better than I expected :D We may need to use the entire basic authorship machinery, but for the beginning it is probably okay.

koute · 2021-11-05T06:28:04Z

Also, mostly just for fun I've checked - bumping wasmtime from 0.30 to 0.31 (from my other PR) increases performance by ~6% (at least on my machine):

Block production/4673 transfers
    time:   [1.2514 s 1.2534 s 1.2556 s]
    thrpt:  [3.7217 Kelem/s 3.7283 Kelem/s 3.7344 Kelem/s]
change:
    time:   [-7.0532% -6.4367% -5.5795%] (p = 0.00 < 0.05)
    thrpt:  [+5.9092% +6.8795% +7.5884%]
    Performance has improved.

Creating all of those extrinsics takes up *a lot* of time, up to the point where the majority of the time is actually spent *outside* of the code which we want to benchmark here. So let's only do it once.

bkchr · 2021-11-05T12:34:48Z

Nice :P

I meant we should switch using basic-authorhip, as shown here: https://github.com/paritytech/substrate/tree/master/client/basic-authorship

This makes it more complicated, however I assume that this will slow it down a little bit. If not, our code seems to be good enough performancewise :P

koute · 2021-11-05T12:49:10Z

I meant we should switch using basic-authorhip, as shown here: https://github.com/paritytech/substrate/tree/master/client/basic-authorship

Hmm... okay, so how about we add this benchmark mostly as-is, and add another one with that extra machinery included? (I think this benchmark should still be somewhat useful anyway? If nothing else it can be used to easily compare native/interpreted/compiled executions and/or check if new versions of wasmtime have any improvements.) Does that sound good?

bkchr · 2021-11-05T14:19:43Z

Yeah, as already said before, I'm fine with merging as is :)

bkchr · 2021-11-05T14:20:03Z

I'm more "worried" that there are no optimization opportunities xD

bkchr · 2021-11-05T17:35:55Z

Okay, one more thing should be done. Please add a second variant that is building blocks with proof generation enabled. (this is just a setting of the block builder)

koute · 2021-11-09T06:56:46Z

Added a variant of the benchmark with proof recording; the difference in performance is minimal:

Block production/4673 transfers (no proof)
                        time:   [1.3151 s 1.3181 s 1.3210 s]
                        thrpt:  [3.5373 Kelem/s 3.5452 Kelem/s 3.5533 Kelem/s]

Block production/4673 transfers (with proof)
                        time:   [1.3597 s 1.3619 s 1.3641 s]
                        thrpt:  [3.4257 Kelem/s 3.4312 Kelem/s 3.4367 Kelem/s]

gilescope · 2021-11-09T09:06:36Z

bin/node/cli/benches/block_production.rs

+		informant_output_format: Default::default(),
+		wasm_runtime_overrides: None,
+	};
+


I feel like Configuration should really have a Default impl. Then you can Configuration { impl_name: "BenchmarkImpl".into(), ..Default::default }.
I feel like we don't care about most of these details here.

bkchr · 2021-11-09T12:08:21Z

Added a variant of the benchmark with proof recording; the difference in performance is minimal:

Block production/4673 transfers (no proof)
                        time:   [1.3151 s 1.3181 s 1.3210 s]
                        thrpt:  [3.5373 Kelem/s 3.5452 Kelem/s 3.5533 Kelem/s]

Block production/4673 transfers (with proof)
                        time:   [1.3597 s 1.3619 s 1.3641 s]
                        thrpt:  [3.4257 Kelem/s 3.4312 Kelem/s 3.4367 Kelem/s]

Ahh yeah, we always read the same 2 positions. So, this is cached :P But let us merge this and then in a follow up we should try to use unique accounts for sending and receiving, for every tx.

koute · 2021-11-09T13:38:11Z

Sounds good to me; I'll add the unique accounts in a separate PR.

koute · 2021-11-09T13:38:23Z

bot merge

* Add a block production benchmark * Simplify the block production benchmark * Cleanups; switch execution strategy to WASM * Switch WASM execution to `Compiled` * Reduce the setup cost of the benchmark Creating all of those extrinsics takes up *a lot* of time, up to the point where the majority of the time is actually spent *outside* of the code which we want to benchmark here. So let's only do it once. * Add a variant of the block production benchmark with proof recording

Add a block production benchmark

eca3243

koute requested review from gilescope and bkchr October 27, 2021 06:35

bkchr reviewed Oct 29, 2021

View reviewed changes

bin/node/cli/benches/block_production.rs Outdated Show resolved Hide resolved

bin/node/cli/benches/block_production.rs Outdated Show resolved Hide resolved

shawntabrizi reviewed Oct 31, 2021

View reviewed changes

Simplify the block production benchmark

92bf63c

bkchr approved these changes Nov 1, 2021

View reviewed changes

Cleanups; switch execution strategy to WASM

1c0c499

Switch WASM execution to Compiled

66c44b7

Reduce the setup cost of the benchmark

da0b972

Creating all of those extrinsics takes up *a lot* of time, up to the point where the majority of the time is actually spent *outside* of the code which we want to benchmark here. So let's only do it once.

Merge branch 'master' into master_block_production_benchmark

a0bf232

Add a variant of the block production benchmark with proof recording

4d16a09

gilescope reviewed Nov 9, 2021

View reviewed changes

gilescope approved these changes Nov 9, 2021

View reviewed changes

paritytech-processbot bot merged commit 800fac1 into paritytech:master Nov 9, 2021

github-actions bot mentioned this pull request Dec 2, 2021

Update substrate/polkadot/cumulus from v0.9.12 to v0.9.13 moonbeam-foundation/moonbeam#1050

Closed

ghzlatarev mentioned this pull request Dec 6, 2021

[Manta] Integrate v0.9.13 upstream changes Manta-Network/Manta#294

Merged

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a block production benchmark #10104

Add a block production benchmark #10104

koute commented Oct 27, 2021

shawntabrizi Oct 31, 2021

koute Nov 1, 2021

bkchr left a comment

bkchr Nov 1, 2021

koute Nov 1, 2021

bkchr Nov 9, 2021

bkchr Nov 1, 2021

bkchr Nov 1, 2021

koute commented Nov 1, 2021

koute commented Nov 2, 2021

bkchr commented Nov 2, 2021

koute commented Nov 5, 2021

bkchr commented Nov 5, 2021

koute commented Nov 5, 2021

bkchr commented Nov 5, 2021

bkchr commented Nov 5, 2021

bkchr commented Nov 5, 2021

koute commented Nov 9, 2021

gilescope Nov 9, 2021

bkchr commented Nov 9, 2021

koute commented Nov 9, 2021

koute commented Nov 9, 2021


		// You should have received a copy of the GNU General Public License
		// along with this program. If not, see <https://www.gnu.org/licenses/>.

	block_builder.push(extrinsic_set_time(1 + 1500)).unwrap();
	block_builder.push(extrinsic_set_time(1 + MILLISECS_PER_BLOCK)).unwrap();

	extrinsics.push(extrinsic_set_time(1 + 1500));
	extrinsics.push(extrinsic_set_time(1 + MILLISECS_PER_BLOCK));

	use node_runtime::{constants::currency::*, BalancesCall};
	use node_runtime::{constants::{currency::*, time::MILLISECS_PER_BLOCK}, BalancesCall};

Add a block production benchmark #10104

Add a block production benchmark #10104

Conversation

koute commented Oct 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koute commented Nov 1, 2021

koute commented Nov 2, 2021

bkchr commented Nov 2, 2021

koute commented Nov 5, 2021

bkchr commented Nov 5, 2021

koute commented Nov 5, 2021

bkchr commented Nov 5, 2021

bkchr commented Nov 5, 2021

bkchr commented Nov 5, 2021

koute commented Nov 9, 2021

Choose a reason for hiding this comment

bkchr commented Nov 9, 2021

koute commented Nov 9, 2021

koute commented Nov 9, 2021