-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create initial version of a "dashboard" that shows long-term performance trends #192
Comments
This sounds great! |
I think it's time to at least consider this -- we're currently at about 1h40m, which is quite long, though within the necessary limits for us to keep up with CI. However, since we are planning on adding more benchmarks, we should look into either faster hardware (currently we have I'm not sure what resources we have for dedicated hardware (since we're measuring timing information we probably want to avoid the cloud). |
I agree that running the benchmarks in parallel is the better option. That CPU is already pretty fast. I doubt that you can get something that is more than 50% faster in single-threaded scenarios (which we still have quite a few of). If money were not an issue, I'd probably rent a couple of dedicated servers somewhere. That would be easiest to administrate, I guess. I'll ask around a bit. Would implementing multi machine execution be a lot of effort? |
Somewhat, but we'll need to do most of the work anyway for proper try build scheduling from the frontend. |
I wrote a script to gather data on several of the current benchmarks across a bunch of releases. Mock-up dashboard is here. I'll be updating this soon to include the "add println!" case. |
I see https://perf.rust-lang.org/dashboard.html is now up, thanks to #238. I think having a per-version compile time tracker is a very good thing. But I also have some major concerns with the current implementation, mostly around the fact that it's very unclear to me what is being measured. Y-axis:
Terms:
Measurements
|
Can you say why? I believe that @scottmcm proposed log scale to make it easier to observe large percentage changes even when the absolute measure is low. I definitely found the graph more informative this way than on linear scale.
Why?
Personally, I'm very interested to get a dashboard view of not just relative improvements, but also of the actual user experience as conveyed by absolute times. (We've also discussed some possible targets in terms of absolute times for various workflows). Now, that said, there are multiple useful views into the data oriented around absolute times; on the spreadsheet version, I found it interesting to break things down by percentile, to get a sense for "typical" vs "outlier" experience, but the set of benchmarks is small enough that this probably isn't very meaningful.
Yes, as I mentioned on IRC the subset of benchmarks are ones that succeed on all versions. |
Log scales are visually misleading. In this case, compile time increases are visually diminished and compile time decreases are visually exaggerated. Log scales are a reasonable choice when your values cover multiple orders of magnitude, but currently the difference between the minimum and maximum values is less than 5x, which isn't much at all. (And the graphs could easily be made taller.) Starting the y-axis above zero is also visually misleading. E.g. in the top graph a reduction in time from 2s to 1s would look like an enormous relative improvement. Averages can be dangerous, especially arithmetic averages, especially if the measurement sets aren't consistent. E.g. if script-servo fails to compile with a particular version the results will suddenly be much better because |
Note that the geometric mean is just the arithmetic mean on a log scale, so the geometric mean being the right choice is a strong indicator that a log scale is also the right choice. Also, the just-noticeable difference for a quantity is a generally relative, not absolute, so a log scale is the best one to answer "has this improved (or regressed) enough that people would notice?" Seow (2008) says that this holds for distinguishing between two durations, and suggests a threshold of 20% -- which is easiest to see on a log scale, where it's ±0.2 (if using natural log). |
I know what geometric mean is, and arithmetic mean, and a log scale, but I don't understand that sentence at all. Can you explain more or give a link?
I am interested in changes as small as 1%. Such changes are easy hard to spot if the graph is tall enough and the data points are close to each other, especially when the points are labelled with the actual numeric values. To repeat my arguments against a log scale:
When I first saw the y-axis I honestly thought there was a bug in the graphing code. After some staring, I wondered if it might be log scale. (Every other log scale I've ever seen has been obvious, with markers like 1, 10, 100, 1000, etc.) I had to ask on IRC to confirm. It wasn't a good experience. |
In my view this dashboard is meant to show a general long term trend of compiler performance. The number it shows is an abstract "score" and not meant for daily use when optimizing the compiler, similar to SpecInt or Octane. Both of these use the geometric mean for combining their various sub-benchmarks into one number (see rust-lang/rust#48750 (comment)). @scottmcm also provided some research reference here: rust-lang/rust#48750 (comment). One potential problem with the arithmetic mean is that benchmarks for big crates (like style-servo or script-servo) make improvements in smaller crates almost invisible, while it's not necessarily true that we value improvements for large crates more than improvements for small crates. |
If you plot a bunch of points on a vertical log scale, then take the arithmetic average of their y pixel positions, you'll get the position on that same scale of their geometric mean. Equivalently, |
Indeed, I was looking at the actual numbers today. We have 11 benchmarks used for the dashboard. Here are the "clean" "check" times for them:
The total time is 49.7s. |
I think it's important that we provide some way to get an overview of the absolute times for these workflows, so that we can gauge the typical user experience in each case (and set targets accordingly). |
Should this be closed? There is a dashboard now. |
I think it's non-obvious that the dashboard meets our goals. In particular, inclusion of style-servo means that the numbers are practically not changing for most releases, because the average is heavily skewed by that crate. (One could argue that is reasonable if the performance there has not improved, though). I think there's more to invest in here. |
This should implement rust-lang/rust#48750, but for starters just containing the FROM-SCRATCH and SMALL-CHANGE usage scenarios.
I imagine this to be a separate page on perf.rlo, showing just two graphs, one for each usage scenario.
Each graph should contain a data point for each stable release (starting a few releases back, maybe with 1.17), the current beta, and the current nightly. The score should be calculated as the geometric mean of all build times for the given usage scenario, as listed in rust-lang/rust#48750.
The main tasks to accomplish this are:
cargo test --lib --no-run
)cargo test --no-run
)cargo test --lib --no-run
)cargo test --no-run
)cargo test --test=all --no-run
)-Cincremental
on versions that don't support it yet)That page should make sure that all benchmark results for computing a given aggregate result are available, otherwise the result would be misleadingly low.
The new benchmarks should also be displayed on the regular perf.rlo page. Some open questions:
cargo test
benchmarks? Probably just treat them as separate benchmarks, that is, they would add another row instead of another column:encoding-rs
andencoding-rs (test)
, for example, would not be tied together in any way (although it would be nice if they could keep sharing the same source directory)@Mark-Simulacrum, @nikomatsakis, thoughts?
The text was updated successfully, but these errors were encountered: