-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compiler Performance: Benchmark Definitions #48750
Comments
On IRC, @nagisa suggested adding crates with large amounts of generated code to the FROM-SCRATCH benchmark, in particular
I'll add them to FROM-SCRATCH and DIST for now. |
Interesting choice. I would have expected the mean/median/geometric-mean, something like that, but I guess I can see the appeal of the sum. I wonder if we want to make multiple statistics available. |
I think the median would lose too much information. The arithmetic mean is pretty similar to the sum. The geometric mean sounds like an interesting choice. It would give each sub-benchmark exactly the same weight. I'm not sure if that's desirable. |
What do other benchmark suites do? |
SpecInt uses the geometric mean for combining scores. JetStream too. Sunspider and Kraken just use the sum of all execution times (at least on https://arewefastyet.com). Octane uses the geometric mean too. An additional thought would be to compute the score as |
OK, it seems like the question is how much we want the "absolute times" to matter. That is, if our suite includes one thing that takes a long time (say, 1 minute) and a bunch that are very short (sub-second), then effectively improving those won't matter at all. This seems good and bad. To that end, I wonder if we can just present both. If I had to pick one, though, I'd probably pick geometric mean, since otherwise it seems like smaller tests will just get drowned out and might as well be excluded. I found this text from Wikipedia helpful to understand geometric mean:
|
Geometric mean is a good default for durations (along with geometric standard deviation, for understanding uncertainty in multiple runs of building the same thing). Arithmetic mean is usually a poor choice for one-sided distributions (like time duration) since the usual μ±2σ so easily goes negative for high-variance measurements, making the usual intuitions less helpful. On durations specifically, psychology has shown that the perceived midpoint between two durations is at the geometric mean, not the arithmetic mean, in both humans and animals. (I learned this from Designing and Engineering Time (Seow 2008), which references Human bisection at the geometric mean (Allan 1991) and Bisection of temporal intervals (Church and Deluty 1977), though sadly I don't have a free source.) |
compare.py doesn't try to produce an aggregate score, it just gives speedups per program, viz:
I found that worked well. |
@scottmcm, that's very useful, thanks! |
@nnethercote Yes, that is very useful for gauging the effect of individual optimizations and more detailed investigations. We support a view like this on perf.rlo already (example) and we definitely want to keep and extend that, since it's probably what one is going to use most for day-to-day work. I still think an additional aggregate score per usage scenario and a dashboard showing it over time is useful for seeing longterm trends. |
After thinking some more and re-reading the performance section from Hennessy & Patterson, I think the right thing to do for a summary score is the following.
However, there is a bigger problem that I think needs to be solved first: the benchmark suite is not consistent. New benchmarks get added, old ones get removed, and sometimes benchmarks temporarily fail to compile due to compiler bugs (the compiler needs fixing) or due to the use of unstable code (the benchmarks need fixing). I tweaked the site's code to simply plot the number of benchmarks that ran successfully. Here's the output for a period of time over May and June this year: I think it's critical that we find a way to effectively fill in these gaps. I suggest using interpolation to produce "fake" (but reasonable) values. For example, imagine we have 20 benchmarks, and we run each of them in a particular rustc configuration (e.g. "Check" + "Clean") on 10 different rustc versions. We want 10 different "summary" values, one per rustc version. Imagine also that one of the benchmarks, B, failed to compile on runs 1, 5, 9, and 10, due to bugs in those particular rustc versions. Without interpolation, the summary values for those runs will not match the summary values for all the other runs. Interpolation would work in the following way.
This will fill in most of the gaps, getting us much closer to a complete set of data. It won't help in the case where a benchmark B failed to compile on every run (as happens with some cases with NLL) but it still gets us most of the way there. @Mark-Simulacrum: I looked at the code yesterday to see how to do this. Unfortunately the data structures used didn't seem all that amenable to working out what the missing data points were and adding them in, but maybe you have some ideas. |
May I suggest adding the psl crate to your benchmarks? It generates a huge |
visiting for backlog bonanza. It is not clear if there's any information here that isn't already in https://github.com/rust-lang/rustc-perf @Mark-Simulacrum @rylev @nnethercote I'm just mentioning you in case you happen to see something in this issue's description that seems worth pulling over to the README or other pages for https://github.com/rust-lang/rustc-perf |
I agree that closing this issue is reasonable. A lot of changes have been made to rustc-perf in the past four years, and this issue is no longer serving a useful purpose. |
The compiler performance tracking issue (#48547) defines the four main usage scenarios that we strive to support well. In order to measure how well we are actually doing, we define a benchmark for each scenario. Eventually, perf.rust-lang.org will provide a graph for each of scenario that shows how compile times develop over time.
Methodology
Compiler performance in each scenario is measured by the sum of all build times for a given set of projects. The build settings depend on the usage scenario. The set of projects should contain small, medium, and large ones.
Benchmarks
FROM-SCRATCH - Compiling a project from scratch
Compile the listed projects with each of the following combinations:
Projects:
SMALL-CHANGE - Re-Compiling a project after a small change
For this scenario, we re-compile the project incrementally with a full cache
after a
println!()
statement has been added somewhere.cargo test --lib --no-run
)cargo test --no-run
)cargo test --lib --no-run
)cargo test --no-run
)cargo test --test=all --no-run
)RLS - Continuously re-compiling a project for the Rust Language Server
For this scenario, we run
cargo check
incrementally with a full cacheafter a
println!()
statement has been added somewhere.cargo check
, non-optimized & incremental (w/ full cache)Projects:
NOTE: This is a rather crude method for measuring RLS performance since
there are many more variables that need to be taken into account here. For
example, the RLS will invoke the compiler differently, allowing for things to
be kept in memory that would go onto the disk otherwise. It also produces
"save-analysis" data, which
cargo check
does not, and the creation of whichcan take up a significant amount of time and thus should be measured!
Consequently, the RLS benchmarks need more discussion.
DIST - Compiling a project for maximum runtime performance
For this scenario, we compile the projects from scratch, with maximum
optimizations:
Projects:
Open Questions
Please provide your feedback on how well you think the above benchmarks actually measure what people care about when using the Rust compiler. I expect these definitions to undergo a few cycles of iteration before we are satisfied with them.
cc @rust-lang/wg-compiler-performance
The text was updated successfully, but these errors were encountered: