Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create initial version of a "dashboard" that shows long-term performance trends #192

Open
3 of 10 tasks
michaelwoerister opened this issue Mar 15, 2018 · 16 comments
Open
3 of 10 tasks
Labels
A-ui Issues dealing with the perf.rlo site UI C-feature-request A feature request

Comments

@michaelwoerister
Copy link
Member

michaelwoerister commented Mar 15, 2018

This should implement rust-lang/rust#48750, but for starters just containing the FROM-SCRATCH and SMALL-CHANGE usage scenarios.

I imagine this to be a separate page on perf.rlo, showing just two graphs, one for each usage scenario.

Each graph should contain a data point for each stable release (starting a few releases back, maybe with 1.17), the current beta, and the current nightly. The score should be calculated as the geometric mean of all build times for the given usage scenario, as listed in rust-lang/rust#48750.

The main tasks to accomplish this are:

  • Add the missing benchmarks
    • ripgrep
    • webrender
    • cargo
    • winapi
    • stm32f103xx
    • encoding-rs (cargo test --lib --no-run)
    • clap-rs (cargo test --no-run)
    • regex (cargo test --lib --no-run)
    • syn (cargo test --no-run)
    • futures (cargo test --test=all --no-run)
  • Allow for running benchmarks in a Rust version specific way (so we don't crash when trying to specify -Cincremental on versions that don't support it yet)
  • Setup the new page (e.g. under perf.rlo/dashboard.html)

That page should make sure that all benchmark results for computing a given aggregate result are available, otherwise the result would be misleadingly low.

The new benchmarks should also be displayed on the regular perf.rlo page. Some open questions:

  • Do we need to start looking into running tests in parallel on multiple machines (because they are becoming more and more)?
  • What to do with the cargo test benchmarks? Probably just treat them as separate benchmarks, that is, they would add another row instead of another column: encoding-rs and encoding-rs (test), for example, would not be tied together in any way (although it would be nice if they could keep sharing the same source directory)

@Mark-Simulacrum, @nikomatsakis, thoughts?

@nikomatsakis
Copy link

This sounds great!

@Mark-Simulacrum
Copy link
Member

Do we need to start looking into running tests in parallel on multiple machines (because they are becoming more and more)?

I think it's time to at least consider this -- we're currently at about 1h40m, which is quite long, though within the necessary limits for us to keep up with CI. However, since we are planning on adding more benchmarks, we should look into either faster hardware (currently we have Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz as the CPU with 8 virtual threads) or distribute the build over more machines. Distributing the build is probably the most feasible long-term option, though we'd have to ensure that the same benchmarks are run on the same dedicated hardware.

I'm not sure what resources we have for dedicated hardware (since we're measuring timing information we probably want to avoid the cloud).

@michaelwoerister
Copy link
Member Author

I agree that running the benchmarks in parallel is the better option. That CPU is already pretty fast. I doubt that you can get something that is more than 50% faster in single-threaded scenarios (which we still have quite a few of).

If money were not an issue, I'd probably rent a couple of dedicated servers somewhere. That would be easiest to administrate, I guess. I'll ask around a bit.

Would implementing multi machine execution be a lot of effort?

@Mark-Simulacrum
Copy link
Member

Would implementing multi machine execution be a lot of effort?

Somewhat, but we'll need to do most of the work anyway for proper try build scheduling from the frontend.

@aturon
Copy link
Member

aturon commented May 23, 2018

I wrote a script to gather data on several of the current benchmarks across a bunch of releases. Mock-up dashboard is here. I'll be updating this soon to include the "add println!" case.

@nnethercote
Copy link
Contributor

I see https://perf.rust-lang.org/dashboard.html is now up, thanks to #238. I think having a per-version compile time tracker is a very good thing. But I also have some major concerns with the current implementation, mostly around the fact that it's very unclear to me what is being measured.

Y-axis:

  • Apparently it's a log scale.
    • Linear would be better.
  • It doesn't start at zero.
    • It should.

Terms:

  • What is "latency"? Is it compile time?
    • Could we just use "compile time"?
  • Use of the term "workflow" is surprising; that term isn't used in the "graphs" or "compare" views.
    • Within the rustc-perf code, "build kind" is used to refer to this check/debug/opt concept.
  • What is the "build" workflow -- debug or opt?
    • Can we have three graphs, and call them check/debug/opt? (I know that "debug" isn't explicitly used in the other views, because it's the default, but I think it should be explicit.)
  • What is "worst-case" and "best-case"? Do they map onto existing terms used in the "graphs" and "compare" views, like "base incremental" and "clean incremental"?
    • If so, can we use the existing terms?

Measurements

  • What is actually being measured? aturon told me on IRC it's an average. An average of what measurements? What kind of average -- arithmetic, geometric, harmonic?
    • There's an argument that we should use geometric mean and normalize the results from the earliest-measured version as 1, and then report all subsequent measurements relative to that. (I can expand if people want to hear more.)
  • Are the same benchmarks being averaged for all the different versions? What if some versions fail to build one or more benchmarks?

@aturon
Copy link
Member

aturon commented May 30, 2018

@nnethercote

Y-axis:

  • Apparently it's a log scale.
  • Linear would be better.

Can you say why? I believe that @scottmcm proposed log scale to make it easier to observe large percentage changes even when the absolute measure is low. I definitely found the graph more informative this way than on linear scale.

  • It doesn't start at zero.
  • It should.

Why?

There's an argument that we should use geometric mean and normalize the results from the earliest-measured version as 1, and then report all subsequent measurements relative to that.

Personally, I'm very interested to get a dashboard view of not just relative improvements, but also of the actual user experience as conveyed by absolute times. (We've also discussed some possible targets in terms of absolute times for various workflows). Now, that said, there are multiple useful views into the data oriented around absolute times; on the spreadsheet version, I found it interesting to break things down by percentile, to get a sense for "typical" vs "outlier" experience, but the set of benchmarks is small enough that this probably isn't very meaningful.

Are the same benchmarks being averaged for all the different versions? What if some versions fail to build one or more benchmarks?

Yes, as I mentioned on IRC the subset of benchmarks are ones that succeed on all versions.

@nnethercote
Copy link
Contributor

Log scales are visually misleading. In this case, compile time increases are visually diminished and compile time decreases are visually exaggerated. Log scales are a reasonable choice when your values cover multiple orders of magnitude, but currently the difference between the minimum and maximum values is less than 5x, which isn't much at all. (And the graphs could easily be made taller.)

Starting the y-axis above zero is also visually misleading. E.g. in the top graph a reduction in time from 2s to 1s would look like an enormous relative improvement.

Averages can be dangerous, especially arithmetic averages, especially if the measurement sets aren't consistent. E.g. if script-servo fails to compile with a particular version the results will suddenly be much better because script-servo is something close to 50% of the time for the entire benchmark suite. (To give you a sense of the variation, a "clean debug" build of the smallest benchmark, helloworld, takes 0.3B instructions. The same build of script-servo takes 830B instructions. Most of the other benchmarks are 5--40B instructions.) The geometric mean of normalized results is much more robust against these kinds of changes.

@scottmcm
Copy link
Member

scottmcm commented Jun 5, 2018

Note that the geometric mean is just the arithmetic mean on a log scale, so the geometric mean being the right choice is a strong indicator that a log scale is also the right choice.

Also, the just-noticeable difference for a quantity is a generally relative, not absolute, so a log scale is the best one to answer "has this improved (or regressed) enough that people would notice?" Seow (2008) says that this holds for distinguishing between two durations, and suggests a threshold of 20% -- which is easiest to see on a log scale, where it's ±0.2 (if using natural log).

@nnethercote
Copy link
Contributor

Note that the geometric mean is just the arithmetic mean on a log scale

I know what geometric mean is, and arithmetic mean, and a log scale, but I don't understand that sentence at all. Can you explain more or give a link?

suggests a threshold of 20%

I am interested in changes as small as 1%. Such changes are easy hard to spot if the graph is tall enough and the data points are close to each other, especially when the points are labelled with the actual numeric values.

To repeat my arguments against a log scale:

  • A log scale makes sense when the data covers multiple orders of magnitude, which is not the case here.
  • A log scale makes sense when the data is somehow exponential, such as an exponential growth rate, which is not the case here.
  • A linear scale is simpler, more typical, more obvious and intuitive, and non-distorting.

When I first saw the y-axis I honestly thought there was a bug in the graphing code. After some staring, I wondered if it might be log scale. (Every other log scale I've ever seen has been obvious, with markers like 1, 10, 100, 1000, etc.) I had to ask on IRC to confirm. It wasn't a good experience.

@michaelwoerister
Copy link
Member Author

In my view this dashboard is meant to show a general long term trend of compiler performance. The number it shows is an abstract "score" and not meant for daily use when optimizing the compiler, similar to SpecInt or Octane. Both of these use the geometric mean for combining their various sub-benchmarks into one number (see rust-lang/rust#48750 (comment)). @scottmcm also provided some research reference here: rust-lang/rust#48750 (comment).

One potential problem with the arithmetic mean is that benchmarks for big crates (like style-servo or script-servo) make improvements in smaller crates almost invisible, while it's not necessarily true that we value improvements for large crates more than improvements for small crates.

@scottmcm
Copy link
Member

scottmcm commented Jun 7, 2018

I know what geometric mean is, and arithmetic mean, and a log scale, but I don't understand that sentence at all. Can you explain more or give a link?

If you plot a bunch of points on a vertical log scale, then take the arithmetic average of their y pixel positions, you'll get the position on that same scale of their geometric mean.

Equivalently, geomean(it) = mean(it.map(f64::ln)).exp().

@nnethercote
Copy link
Contributor

One potential problem with the arithmetic mean is that benchmarks for big crates (like style-servo or script-servo) make improvements in smaller crates almost invisible, while it's not necessarily true that we value improvements for large crates more than improvements for small crates.

Indeed, I was looking at the actual numbers today. We have 11 benchmarks used for the dashboard. Here are the "clean" "check" times for them:

7.3s, 0.5s, 0.8s, 1.9s, 2.1s, 2.0s, 1.1s, 2.7s, 30.1s, 0.8s, 0.4s.

The total time is 49.7s. style-servo, at 30.1s, accounts for over 60% of the runtime. It's too much. I think we should normalize times so that each benchmark has equal weight. (And then take the geometric mean of the normalized times.) That does get us away from absolute times, which is a shame, but I think evening out the imbalance is more important.

@aturon
Copy link
Member

aturon commented Jul 2, 2018

I think it's important that we provide some way to get an overview of the absolute times for these workflows, so that we can gauge the typical user experience in each case (and set targets accordingly).

@jonmorton
Copy link

Should this be closed? There is a dashboard now.

@Mark-Simulacrum
Copy link
Member

I think it's non-obvious that the dashboard meets our goals. In particular, inclusion of style-servo means that the numbers are practically not changing for most releases, because the average is heavily skewed by that crate. (One could argue that is reasonable if the performance there has not improved, though). I think there's more to invest in here.

@rylev rylev added A-ui Issues dealing with the perf.rlo site UI C-feature-request A feature request labels Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ui Issues dealing with the perf.rlo site UI C-feature-request A feature request
Projects
None yet
Development

No branches or pull requests

8 participants