Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subsystem-bench: cache misses profiling #2893

Merged
merged 28 commits into from
Jan 16, 2024

Conversation

AndreiEres
Copy link
Contributor

@AndreiEres AndreiEres commented Jan 9, 2024

Why we need it

To provide another level of understanding to why polkadot's subsystems may perform slower than expected. Cache misses occur when processing large amounts of data, such as during availability recovery.

Why Cachegrind

Cachegrind has many drawbacks: it is slow, it uses its own cache simulation, which is very basic. But unlike perf, which is a great tool, Cachegrind can run in a virtual machine. This means we can easily run it in remote installations and even use it in CI/CD to catch possible regressions.

Why Cachegrind and not Callgrind, another part of Valgrind? It is simply empirically proven that profiling runs faster with Cachegrind.

First results

First results have been obtained while testing of the approach. Here is an example.

$ target/testnet/subsystem-bench --n-cores 10 --cache-misses data-availability-read
$ cat cachegrind_report.txt
I refs:        64,622,081,485
I1  misses:         3,018,168
LLi misses:           437,654
I1  miss rate:           0.00%
LLi miss rate:           0.00%

D refs:        12,161,833,115  (9,868,356,364 rd   + 2,293,476,751 wr)
D1  misses:       167,940,701  (   71,060,073 rd   +    96,880,628 wr)
LLd misses:        33,550,018  (   16,685,853 rd   +    16,864,165 wr)
D1  miss rate:            1.4% (          0.7%     +           4.2%  )
LLd miss rate:            0.3% (          0.2%     +           0.7%  )

LL refs:          170,958,869  (   74,078,241 rd   +    96,880,628 wr)
LL misses:         33,987,672  (   17,123,507 rd   +    16,864,165 wr)
LL miss rate:             0.0% (          0.0%     +           0.7%  )

The CLI output shows that 1.4% of the L1 data cache missed, which is not so bad, given that the last-level cache had that data most of the time missing only 0.3%. Instruction data of the L1 has 0.00% misses of the time. Looking at an output file with cg_annotate shows that most of the misses occur during reed-solomon, which is expected.

@AndreiEres AndreiEres changed the title [WIP] subsystem-bench: cache misses subsystem-bench: cache misses profiling Jan 10, 2024
@AndreiEres AndreiEres added T10-tests This PR/Issue is related to tests. T12-benchmarks This PR/Issue is related to benchmarking and weights. labels Jan 10, 2024
@AndreiEres AndreiEres marked this pull request as ready for review January 10, 2024 11:42
Copy link
Contributor

@alindima alindima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good to me.

I think we should avoid printing cachegrind output directly to stdout, as it can be confusing. Either print to a file or prepend the valgrind stdout with a header that specifies that valgrind output follows.

@@ -198,6 +216,52 @@ impl BenchCli {
}
}

#[cfg(target_os = "linux")]
fn is_valgrind_mode() -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could add all of these functions to a linux-only valgrind module for better encapsulation. also, we could avoid having empty valgrind functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's good to extract it to a module, but how to avoid empty functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you add #![cfg(target_os = "linux") to the top of the valgrind file, it'll only be compiled on linux. Then you'd have to only call the valgrind functions on linux (add #[cfg()]s to the calling code). Then you wouldn't need empty functions

polkadot/node/subsystem-bench/src/subsystem-bench.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! There are additional options to cache sim which might be useful:

--I1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 instruction cache. Only useful with --cache-sim=yes.

--D1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 data cache. Only useful with --cache-sim=yes.

--LL=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the last-level cache. Only useful with --cache-sim=yes.

The documentation states that currently the simulator approximates a AMD Athlon CPU circa 2002 which is worse than ref hw spec. I think we should tune these values to the ref hardware or the actual host configuration.

#[cfg(target_os = "linux")]
fn valgrind_init() -> eyre::Result<()> {
use std::os::unix::process::CommandExt;
std::process::Command::new("valgrind")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't look that we get an error printed if valgrind is missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@AndreiEres
Copy link
Contributor Author

I think we should avoid printing cachegrind output directly to stdout, as it can be confusing.

That's a good idea. Unfortunately, I couldn't to find a way how to catch the report from stderr, because it appears after the process has completed. So I print it to a report file, which is a good option imho.

@alindima
Copy link
Contributor

That's a good idea. Unfortunately, I couldn't to find a way how to catch the report from stderr, because it appears after the process has completed. So I print it to a report file, which is a good option imho.

I think you could use https://doc.rust-lang.org/std/process/struct.Command.html#method.output for this (which enables you to get stderr as well). But printing to a file is good as well IMO 👍🏻

@AndreiEres
Copy link
Contributor Author

I think we should tune these values to the ref hardware or the actual host configuration.

@sandreim I tuned the simulation config to Intel Ice Lake CPU.

@AndreiEres AndreiEres added the R0-silent Changes should not be mentioned in any release notes label Jan 15, 2024
@AndreiEres AndreiEres enabled auto-merge January 16, 2024 16:31
@AndreiEres AndreiEres added this pull request to the merge queue Jan 16, 2024
Merged via the queue into master with commit ec7bfae Jan 16, 2024
123 of 124 checks passed
@AndreiEres AndreiEres deleted the AndreiEres/bench-cache-misses branch January 16, 2024 17:57
bkchr pushed a commit that referenced this pull request Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
R0-silent Changes should not be mentioned in any release notes T10-tests This PR/Issue is related to tests. T12-benchmarks This PR/Issue is related to benchmarking and weights.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants