Create initial version of a "dashboard" that shows long-term performance trends #192

michaelwoerister · 2018-03-15T11:37:12Z

nikomatsakis · 2018-03-16T15:37:24Z

This sounds great!

Mark-Simulacrum · 2018-03-18T16:09:06Z

Do we need to start looking into running tests in parallel on multiple machines (because they are becoming more and more)?

I think it's time to at least consider this -- we're currently at about 1h40m, which is quite long, though within the necessary limits for us to keep up with CI. However, since we are planning on adding more benchmarks, we should look into either faster hardware (currently we have Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz as the CPU with 8 virtual threads) or distribute the build over more machines. Distributing the build is probably the most feasible long-term option, though we'd have to ensure that the same benchmarks are run on the same dedicated hardware.

I'm not sure what resources we have for dedicated hardware (since we're measuring timing information we probably want to avoid the cloud).

michaelwoerister · 2018-03-19T11:51:56Z

I agree that running the benchmarks in parallel is the better option. That CPU is already pretty fast. I doubt that you can get something that is more than 50% faster in single-threaded scenarios (which we still have quite a few of).

If money were not an issue, I'd probably rent a couple of dedicated servers somewhere. That would be easiest to administrate, I guess. I'll ask around a bit.

Would implementing multi machine execution be a lot of effort?

Mark-Simulacrum · 2018-03-20T03:00:37Z

Would implementing multi machine execution be a lot of effort?

Somewhat, but we'll need to do most of the work anyway for proper try build scheduling from the frontend.

aturon · 2018-05-23T17:36:54Z

I wrote a script to gather data on several of the current benchmarks across a bunch of releases. Mock-up dashboard is here. I'll be updating this soon to include the "add println!" case.

nnethercote · 2018-05-30T02:55:39Z

I see https://perf.rust-lang.org/dashboard.html is now up, thanks to #238. I think having a per-version compile time tracker is a very good thing. But I also have some major concerns with the current implementation, mostly around the fact that it's very unclear to me what is being measured.

Y-axis:

Apparently it's a log scale.
- Linear would be better.
It doesn't start at zero.
- It should.

Terms:

What is "latency"? Is it compile time?
- Could we just use "compile time"?
Use of the term "workflow" is surprising; that term isn't used in the "graphs" or "compare" views.
- Within the rustc-perf code, "build kind" is used to refer to this check/debug/opt concept.
What is the "build" workflow -- debug or opt?
- Can we have three graphs, and call them check/debug/opt? (I know that "debug" isn't explicitly used in the other views, because it's the default, but I think it should be explicit.)
What is "worst-case" and "best-case"? Do they map onto existing terms used in the "graphs" and "compare" views, like "base incremental" and "clean incremental"?
- If so, can we use the existing terms?

Measurements

What is actually being measured? aturon told me on IRC it's an average. An average of what measurements? What kind of average -- arithmetic, geometric, harmonic?
- There's an argument that we should use geometric mean and normalize the results from the earliest-measured version as 1, and then report all subsequent measurements relative to that. (I can expand if people want to hear more.)
Are the same benchmarks being averaged for all the different versions? What if some versions fail to build one or more benchmarks?

aturon · 2018-05-30T03:05:28Z

@nnethercote

Y-axis:

Apparently it's a log scale.

Linear would be better.

Can you say why? I believe that @scottmcm proposed log scale to make it easier to observe large percentage changes even when the absolute measure is low. I definitely found the graph more informative this way than on linear scale.

It doesn't start at zero.

It should.

Why?

There's an argument that we should use geometric mean and normalize the results from the earliest-measured version as 1, and then report all subsequent measurements relative to that.

Personally, I'm very interested to get a dashboard view of not just relative improvements, but also of the actual user experience as conveyed by absolute times. (We've also discussed some possible targets in terms of absolute times for various workflows). Now, that said, there are multiple useful views into the data oriented around absolute times; on the spreadsheet version, I found it interesting to break things down by percentile, to get a sense for "typical" vs "outlier" experience, but the set of benchmarks is small enough that this probably isn't very meaningful.

Are the same benchmarks being averaged for all the different versions? What if some versions fail to build one or more benchmarks?

Yes, as I mentioned on IRC the subset of benchmarks are ones that succeed on all versions.

nnethercote · 2018-05-30T03:30:55Z

Log scales are visually misleading. In this case, compile time increases are visually diminished and compile time decreases are visually exaggerated. Log scales are a reasonable choice when your values cover multiple orders of magnitude, but currently the difference between the minimum and maximum values is less than 5x, which isn't much at all. (And the graphs could easily be made taller.)

Starting the y-axis above zero is also visually misleading. E.g. in the top graph a reduction in time from 2s to 1s would look like an enormous relative improvement.

Averages can be dangerous, especially arithmetic averages, especially if the measurement sets aren't consistent. E.g. if script-servo fails to compile with a particular version the results will suddenly be much better because script-servo is something close to 50% of the time for the entire benchmark suite. (To give you a sense of the variation, a "clean debug" build of the smallest benchmark, helloworld, takes 0.3B instructions. The same build of script-servo takes 830B instructions. Most of the other benchmarks are 5--40B instructions.) The geometric mean of normalized results is much more robust against these kinds of changes.

scottmcm · 2018-06-05T07:35:37Z

Note that the geometric mean is just the arithmetic mean on a log scale, so the geometric mean being the right choice is a strong indicator that a log scale is also the right choice.

Also, the just-noticeable difference for a quantity is a generally relative, not absolute, so a log scale is the best one to answer "has this improved (or regressed) enough that people would notice?" Seow (2008) says that this holds for distinguishing between two durations, and suggests a threshold of 20% -- which is easiest to see on a log scale, where it's ±0.2 (if using natural log).

nnethercote · 2018-06-05T13:00:11Z

Note that the geometric mean is just the arithmetic mean on a log scale

I know what geometric mean is, and arithmetic mean, and a log scale, but I don't understand that sentence at all. Can you explain more or give a link?

suggests a threshold of 20%

I am interested in changes as small as 1%. Such changes are easy hard to spot if the graph is tall enough and the data points are close to each other, especially when the points are labelled with the actual numeric values.

To repeat my arguments against a log scale:

A log scale makes sense when the data covers multiple orders of magnitude, which is not the case here.
A log scale makes sense when the data is somehow exponential, such as an exponential growth rate, which is not the case here.
A linear scale is simpler, more typical, more obvious and intuitive, and non-distorting.

When I first saw the y-axis I honestly thought there was a bug in the graphing code. After some staring, I wondered if it might be log scale. (Every other log scale I've ever seen has been obvious, with markers like 1, 10, 100, 1000, etc.) I had to ask on IRC to confirm. It wasn't a good experience.

michaelwoerister · 2018-06-05T14:53:39Z

In my view this dashboard is meant to show a general long term trend of compiler performance. The number it shows is an abstract "score" and not meant for daily use when optimizing the compiler, similar to SpecInt or Octane. Both of these use the geometric mean for combining their various sub-benchmarks into one number (see rust-lang/rust#48750 (comment)). @scottmcm also provided some research reference here: rust-lang/rust#48750 (comment).

One potential problem with the arithmetic mean is that benchmarks for big crates (like style-servo or script-servo) make improvements in smaller crates almost invisible, while it's not necessarily true that we value improvements for large crates more than improvements for small crates.

scottmcm · 2018-06-07T07:07:24Z

I know what geometric mean is, and arithmetic mean, and a log scale, but I don't understand that sentence at all. Can you explain more or give a link?

If you plot a bunch of points on a vertical log scale, then take the arithmetic average of their y pixel positions, you'll get the position on that same scale of their geometric mean.

Equivalently, geomean(it) = mean(it.map(f64::ln)).exp().

nnethercote · 2018-06-28T06:45:06Z

One potential problem with the arithmetic mean is that benchmarks for big crates (like style-servo or script-servo) make improvements in smaller crates almost invisible, while it's not necessarily true that we value improvements for large crates more than improvements for small crates.

Indeed, I was looking at the actual numbers today. We have 11 benchmarks used for the dashboard. Here are the "clean" "check" times for them:

7.3s, 0.5s, 0.8s, 1.9s, 2.1s, 2.0s, 1.1s, 2.7s, 30.1s, 0.8s, 0.4s.

The total time is 49.7s. style-servo, at 30.1s, accounts for over 60% of the runtime. It's too much. I think we should normalize times so that each benchmark has equal weight. (And then take the geometric mean of the normalized times.) That does get us away from absolute times, which is a shame, but I think evening out the imbalance is more important.

aturon · 2018-07-02T22:34:45Z

I think it's important that we provide some way to get an overview of the absolute times for these workflows, so that we can gauge the typical user experience in each case (and set targets accordingly).

jonmorton · 2020-09-29T04:47:05Z

Should this be closed? There is a dashboard now.

Mark-Simulacrum · 2020-09-29T12:20:35Z

I think it's non-obvious that the dashboard meets our goals. In particular, inclusion of style-servo means that the numbers are practically not changing for most releases, because the average is heavily skewed by that crate. (One could argue that is reasonable if the performance there has not improved, though). I think there's more to invest in here.

Mark-Simulacrum mentioned this issue May 24, 2018

Add Cargo, Webrender, Ripgrep #233

Merged

rylev added A-ui Issues dealing with the perf.rlo site UI C-feature-request A feature request labels Jul 29, 2021

the8472 mentioned this issue Apr 27, 2024

Dashboard should use log-scale and have a smaller aspect ratio #1901

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create initial version of a "dashboard" that shows long-term performance trends #192

Create initial version of a "dashboard" that shows long-term performance trends #192

michaelwoerister commented Mar 15, 2018 •

edited by Mark-Simulacrum

Loading

nikomatsakis commented Mar 16, 2018

Mark-Simulacrum commented Mar 18, 2018

michaelwoerister commented Mar 19, 2018

Mark-Simulacrum commented Mar 20, 2018

aturon commented May 23, 2018

nnethercote commented May 30, 2018

aturon commented May 30, 2018

nnethercote commented May 30, 2018

scottmcm commented Jun 5, 2018

nnethercote commented Jun 5, 2018

michaelwoerister commented Jun 5, 2018

scottmcm commented Jun 7, 2018

nnethercote commented Jun 28, 2018

aturon commented Jul 2, 2018

jonmorton commented Sep 29, 2020

Mark-Simulacrum commented Sep 29, 2020

Create initial version of a "dashboard" that shows long-term performance trends #192

Create initial version of a "dashboard" that shows long-term performance trends #192

Comments

michaelwoerister commented Mar 15, 2018 • edited by Mark-Simulacrum Loading

nikomatsakis commented Mar 16, 2018

Mark-Simulacrum commented Mar 18, 2018

michaelwoerister commented Mar 19, 2018

Mark-Simulacrum commented Mar 20, 2018

aturon commented May 23, 2018

nnethercote commented May 30, 2018

aturon commented May 30, 2018

nnethercote commented May 30, 2018

scottmcm commented Jun 5, 2018

nnethercote commented Jun 5, 2018

michaelwoerister commented Jun 5, 2018

scottmcm commented Jun 7, 2018

nnethercote commented Jun 28, 2018

aturon commented Jul 2, 2018

jonmorton commented Sep 29, 2020

Mark-Simulacrum commented Sep 29, 2020

michaelwoerister commented Mar 15, 2018 •

edited by Mark-Simulacrum

Loading