subsystem-bench: cache misses profiling #2893

AndreiEres · 2024-01-09T17:29:06Z

Why we need it

To provide another level of understanding to why polkadot's subsystems may perform slower than expected. Cache misses occur when processing large amounts of data, such as during availability recovery.

Why Cachegrind

Cachegrind has many drawbacks: it is slow, it uses its own cache simulation, which is very basic. But unlike perf, which is a great tool, Cachegrind can run in a virtual machine. This means we can easily run it in remote installations and even use it in CI/CD to catch possible regressions.

Why Cachegrind and not Callgrind, another part of Valgrind? It is simply empirically proven that profiling runs faster with Cachegrind.

First results

First results have been obtained while testing of the approach. Here is an example.

$ target/testnet/subsystem-bench --n-cores 10 --cache-misses data-availability-read
$ cat cachegrind_report.txt
I refs:        64,622,081,485
I1  misses:         3,018,168
LLi misses:           437,654
I1  miss rate:           0.00%
LLi miss rate:           0.00%

D refs:        12,161,833,115  (9,868,356,364 rd   + 2,293,476,751 wr)
D1  misses:       167,940,701  (   71,060,073 rd   +    96,880,628 wr)
LLd misses:        33,550,018  (   16,685,853 rd   +    16,864,165 wr)
D1  miss rate:            1.4% (          0.7%     +           4.2%  )
LLd miss rate:            0.3% (          0.2%     +           0.7%  )

LL refs:          170,958,869  (   74,078,241 rd   +    96,880,628 wr)
LL misses:         33,987,672  (   17,123,507 rd   +    16,864,165 wr)
LL miss rate:             0.0% (          0.0%     +           0.7%  )

The CLI output shows that 1.4% of the L1 data cache missed, which is not so bad, given that the last-level cache had that data most of the time missing only 0.3%. Instruction data of the L1 has 0.00% misses of the time. Looking at an output file with cg_annotate shows that most of the misses occur during reed-solomon, which is expected.

alindima

Generally looking good to me.

I think we should avoid printing cachegrind output directly to stdout, as it can be confusing. Either print to a file or prepend the valgrind stdout with a header that specifies that valgrind output follows.

alindima · 2024-01-11T08:05:45Z

polkadot/node/subsystem-bench/src/subsystem-bench.rs

@@ -198,6 +216,52 @@ impl BenchCli {
 	}
 }

+#[cfg(target_os = "linux")]
+fn is_valgrind_mode() -> bool {


nit: we could add all of these functions to a linux-only valgrind module for better encapsulation. also, we could avoid having empty valgrind functions.

Yes, it's good to extract it to a module, but how to avoid empty functions?

if you add #![cfg(target_os = "linux") to the top of the valgrind file, it'll only be compiled on linux. Then you'd have to only call the valgrind functions on linux (add #[cfg()]s to the calling code). Then you wouldn't need empty functions

polkadot/node/subsystem-bench/src/subsystem-bench.rs

sandreim

LGTM! There are additional options to cache sim which might be useful:

--I1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 instruction cache. Only useful with --cache-sim=yes.

--D1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 data cache. Only useful with --cache-sim=yes.

--LL=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the last-level cache. Only useful with --cache-sim=yes.

The documentation states that currently the simulator approximates a AMD Athlon CPU circa 2002 which is worse than ref hw spec. I think we should tune these values to the ref hardware or the actual host configuration.

sandreim · 2024-01-11T10:37:40Z

polkadot/node/subsystem-bench/src/subsystem-bench.rs

+#[cfg(target_os = "linux")]
+fn valgrind_init() -> eyre::Result<()> {
+	use std::os::unix::process::CommandExt;
+	std::process::Command::new("valgrind")


it doesn't look that we get an error printed if valgrind is missing

Good catch!

AndreiEres · 2024-01-11T15:04:35Z

I think we should avoid printing cachegrind output directly to stdout, as it can be confusing.

That's a good idea. Unfortunately, I couldn't to find a way how to catch the report from stderr, because it appears after the process has completed. So I print it to a report file, which is a good option imho.

alindima · 2024-01-11T15:27:53Z

That's a good idea. Unfortunately, I couldn't to find a way how to catch the report from stderr, because it appears after the process has completed. So I print it to a report file, which is a good option imho.

I think you could use https://doc.rust-lang.org/std/process/struct.Command.html#method.output for this (which enables you to get stderr as well). But printing to a file is good as well IMO 👍🏻

AndreiEres · 2024-01-15T13:28:23Z

I think we should tune these values to the ref hardware or the actual host configuration.

@sandreim I tuned the simulation config to Intel Ice Lake CPU.

AndreiEres added 7 commits January 9, 2024 15:03

Add bench cache misses

240185d

Fix return type

81ed8ac

Fix valgrind command

550420a

Fix measurements

cf2a144

Switch to cachegrind

f11656a

Fix command

ea265fd

Add readme instructions

24f0983

AndreiEres changed the title ~~[WIP] subsystem-bench: cache misses~~ subsystem-bench: cache misses profiling Jan 10, 2024

Add more examples to readme

0ab0143

AndreiEres added T10-tests This PR/Issue is related to tests. T12-benchmarks This PR/Issue is related to benchmarking and weights. labels Jan 10, 2024

Merge branch 'master' into AndreiEres/bench-cache-misses

9fbfccb

AndreiEres marked this pull request as ready for review January 10, 2024 11:42

Fix markdown

d2a5dc5

alindima reviewed Jan 11, 2024

View reviewed changes

sandreim approved these changes Jan 11, 2024

View reviewed changes

AndreiEres added 6 commits January 11, 2024 12:40

Move to a module

a2267b4

fix error

84c4e2b

Fix report preparing

e7a2d4b

Fix report preparing

b38a990

Save report to a file

a6caa23

Update the instruction

20ceaa7

AndreiEres added 2 commits January 11, 2024 16:31

Show error if valgrind command failed

d15fd83

Adjust to recommended CPU

0cfb180

AndreiEres added the R0-silent Changes should not be mentioned in any release notes label Jan 15, 2024

AndreiEres added 2 commits January 15, 2024 14:46

Merge branch 'master' into AndreiEres/bench-cache-misses

36e0de3

Make crabgrind optional

03d6f27

AndreiEres added 7 commits January 15, 2024 16:53

Add valgrind feature

503ccfb

Fix feature detecting

5a3b421

Fix explanation

e0da4c4

Update readme

d9cde05

Address clippy errors

18b7161

Remove crabgrind

5da8447

Update readme

219a05c

alindima approved these changes Jan 16, 2024

View reviewed changes

Merge branch 'master' into AndreiEres/bench-cache-misses

4b61166

AndreiEres enabled auto-merge January 16, 2024 16:31

AndreiEres added this pull request to the merge queue Jan 16, 2024

Merged via the queue into master with commit ec7bfae Jan 16, 2024
123 of 124 checks passed

AndreiEres deleted the AndreiEres/bench-cache-misses branch January 16, 2024 17:57

github-actions bot mentioned this pull request Mar 13, 2024

Update polkadot-sdk from v1.3.0 to v1.7.2 moonbeam-foundation/moonbeam#2703

Closed

bkchr pushed a commit that referenced this pull request Apr 10, 2024

Move relay clients to separate folder (#2893)

449ea7b

This was referenced Jun 5, 2024

Update polkadot-sdk from v1.7.0 to v1.11.0 moondance-labs/tanssi#573

Closed

Update polkadot-sdk from v1.10.0 to v1.11.0 moondance-labs/tanssi#577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subsystem-bench: cache misses profiling #2893

subsystem-bench: cache misses profiling #2893

AndreiEres commented Jan 9, 2024 •

edited

Loading

alindima left a comment

alindima Jan 11, 2024

AndreiEres Jan 11, 2024

alindima Jan 11, 2024

sandreim left a comment

sandreim Jan 11, 2024

AndreiEres Jan 11, 2024

AndreiEres commented Jan 11, 2024

alindima commented Jan 11, 2024

AndreiEres commented Jan 15, 2024

subsystem-bench: cache misses profiling #2893

subsystem-bench: cache misses profiling #2893

Conversation

AndreiEres commented Jan 9, 2024 • edited Loading

Why we need it

Why Cachegrind

First results

alindima left a comment

Choose a reason for hiding this comment

alindima Jan 11, 2024

Choose a reason for hiding this comment

AndreiEres Jan 11, 2024

Choose a reason for hiding this comment

alindima Jan 11, 2024

Choose a reason for hiding this comment

sandreim left a comment

Choose a reason for hiding this comment

sandreim Jan 11, 2024

Choose a reason for hiding this comment

AndreiEres Jan 11, 2024

Choose a reason for hiding this comment

AndreiEres commented Jan 11, 2024

alindima commented Jan 11, 2024

AndreiEres commented Jan 15, 2024

AndreiEres commented Jan 9, 2024 •

edited

Loading