Track PR metrics with Bencher #3849

epompeii · 2023-11-10T16:13:24Z

This PR is a follow up from a discussion with @weiznich on adding the nightly master metrics to Bencher: #3848

A separate Testbed has been created for all three targets in order to distinguish their results (ie ubuntu-latest-postgres, ubuntu-latest-sqlite, and ubuntu-latest-mysql).
Currently only the sqlite metrics are being ingested by Bencher, due to build failures in the other two. Please, let me know if you would like for me to try to fix those in a follow up PR.

The results can be seen here: https://bencher.dev/perf/diesel
Note that it takes a hot second for all 99 results to be returned.
Those results are from a test run on epompeii/diesel: https://github.com/epompeii/diesel/actions/runs/6826611755/job/18566903071#step:10:3477

In order for this to work on diesel-rs/diesel a maintainer will need to add the BENCHER_API_TOKEN as a repository secret. I could either send over the key I setup or add one of the maintainers of diesel to the project on Bencher and you can use your own API token (preferred solution).

weiznich · 2023-11-13T15:18:45Z

Thanks for opening this PR. For now I would like to use the bench workflow (that is used to check for regressions in PR's) as experimentation workflow as it contains much less benchmarks and might benefit more from getting comments on PR's and similar improvements.

As a more general note for bencher itself: It would be great if the frontend could group benchmarks from the same group somewhat. We likely do not want to see all benchmarks for all sizes at the same time. Either we would be interested in seeing all insert benchmarks of a certain size for the different crates, or we would be interested in seeing all sizes for one specifc crate.

epompeii · 2023-11-14T12:36:52Z

For now I would like to use the bench workflow (that is used to check for regressions in PR's)

Oh, my misunderstanding. I'm sorry! I'll make the change to the bench workflow.

It would be great if the frontend could group benchmarks from the same group somewhat.

I totally agree! My current thinking is split between two ways of doing this:

Create a way to add tag particular dimensions (branch, testbed, benchmark). Then you could filter benchmarks based on a particular tag, say to view all insert benchmarks you would tag each insert benchmark with operation: insertetc. (Tracking issue: Create custom tags for all dimensions bencherdev/bencher#240)
Add a way to search each dimension (branch, testbed, benchmark) in the perf plot view and then allow you to save/star particular plots (Add a built in way to save plots bencherdev/bencher#231) and even view them in a dashboard like view (Add a built in way to pin plots bencherdev/bencher#232). So you could have an insert benchmarks plot and trivial_query plot, etc. Currently you have to do this with browser bookmarks and multiple windows.

I think either way, I will end up doing 2, so the question then becomes, would it be worth it to also have 1?
I'm very much open to suggestions here if you have any 😃

epompeii · 2023-11-15T01:52:52Z

Okay, I have moved things over to the bench workflow!
You can see an example PR where this branch is serving as the target branch: epompeii#1

There are three comments on the PR, one for each testbed: ubuntu-latest-sqlite, ubuntu-latest-mysql, and ubuntu-latest-postgres.

I have an option to only start commenting once there is an Alert is generated (using the --ci-only-on-alert flag). I didn't enable it so you could see the output, but let me know if you want that in the final version.

Currently the "tag to run benchmarks" mechanism is in place, but it is not meant as a security feature. Once the tag is added any subsequent revisions with that tag will also be benchmarked. I can setup GitHub Actions to require a reviewer before each run if you would prefer that. However, the way that I have things set up now should be secure from pwn requests.

weiznich · 2023-11-19T18:29:39Z

I just want to leave a comment here that this PR is not forgotten. I'm away for a few days. I try to have a closer look when I'm back.

weiznich · 2023-12-01T11:43:24Z

Thanks for updating the PR to work with the comparing results on PR's. That look helpful.

There are a few things I would like clarify before moving forward:

Current the summaries posted to the PR do not contain any indication by how much the performance changed. It would be really helpful to have this information in the PR comments.
If I click on any of the benchmark result links I'm asked for a login. All the information there needs to be public, as we cannot expect from contributors to have an account there.

epompeii · 2023-12-03T14:58:42Z

Thank you for the great feedback!
The example PR has been updated with the requested changes: epompeii#1

Current the summaries posted to the PR do not contain any indication by how much the performance changed. It would be really helpful to have this information in the PR comments.

Great point! I've now added two sets of data points:
Metric Results:

The numeric metric result (in this case latency measured in nanoseconds)
The percent performance change from the metric result vs the baseline (Δ%)

Metric Boundary:

The numeric boundary limit used for calculating if an alert should be generated (in this case an upper boundary)
The degree of how close the metric result is to the boundary limit (%)

If I click on any of the benchmark result links I'm asked for a login. All the information there needs to be public, as we cannot expect from contributors to have an account there.

Very true! I have update things so now for all public projects the PR comment only uses public links.
If an alert is generated, then the link from both the PR comment and the public perf plot itself now goes to a public alert page. This is an example from me dogfooding with Bencher: https://bencher.dev/perf/bencher/alerts/67dcb5fb-f164-4252-81f9-d9cae2159d15

I've also split out the benchmarking of the master branch and the PR branches a bit more. This will keep things more statistically sound as the master branch will only be benchmarked once per new push as opposed to on every PR run. This has the added benefit of halving the PR benchmarking times.

epompeii · 2023-12-07T14:13:06Z

I've been doing a lot of thinking lately about the security implications of running benchmarks from fork PRs.

Was the original intention behind the "run on tag" semantics meant as a security feature (requiring review) or simply as a speed of CI optimization (on run benchmarks on PRs where it would matter)?

If the intention of the "run on tag" was meant as a security feature, I think we should move over to creating a couple of GitHub Environments and add Required Reviewers for fork PRs. I have a write up on this option here: https://bencher.dev/docs/how-to/github-actions/#benchmark-fork-pr-from-target-branch-with-required-reviewers

If the "run on tag" was meant more as a speed of CI optimization, I can work that into this much safer option: https://bencher.dev/docs/how-to/github-actions/#benchmark-fork-pr-and-upload-from-default-branch

Please, let me know what you think, and I will update this PR accordingly!

weiznich · 2023-12-08T12:06:18Z

Thanks for the updates. It's good to see what it would look like.

You mention that this approach now reuses "old" builds for the master branch as optimization. That might become problematic, as it seems like there is really a lot of noise between running benchmark jobs on different action runners. The old approach did run both the current PR and the master PR after each other on the same machine. That would enforce the all benchmarks have a similar environment. It reduces the amount of noise drastically in my opinion, which is the thing that it makes helpful for judging the performance impact of certain changes. I'm not sure that this behavior is still there with the new approach? Maybe you can clarify that?

Was the original intention behind the "run on tag" semantics meant as a security feature (requiring review) or simply as a speed of CI optimization (on run benchmarks on PRs where it would matter)?

The original reasoning behind the tag based solution is mostly that it is a speed optimization, as everything is "local" to the PR. The worst thing that could happened there was that someone submitted a PR that wast CI time. There are no results that would have been stored for a longer time, so it couldn't influence anything else.

epompeii · 2023-12-08T12:47:59Z

You mention that this approach now reuses "old" builds for the master branch as optimization. That might become problematic, as it seems like there is really a lot of noise between running benchmark jobs on different action runners. The old approach did run both the current PR and the master PR after each other on the same machine. That would enforce the all benchmarks have a similar environment. It reduces the amount of noise drastically in my opinion, which is the thing that it makes helpful for judging the performance impact of certain changes. I'm not sure that this behavior is still there with the new approach? Maybe you can clarify that?

I do need to clarify 😃
The main motivation behind splitting out the benchmarking of the master branch and the PR branches is to keep things more statistically sound (not as an optimization). That is, Bencher keeps track of the benchmark results for a particular branch over time. Running the master branch every time there is a PR would lead to irregular over sampling of the master branch.

For example, under the old system if we had 3 PRs against master for HEAD#1 and then later only 1 PR against HEAD#2 we would have 4 samples for HEAD#1 and only 2 samples for HEAD#2 (1 initial branch push run + each PR run). This means we would be statistically over weighting HEAD#1 in our calculations.

The statistical thresholds used by Bencher are what are meant to help deal with this noise, and we only want a single sample run against any branch per change. So yes, being able to judge the performance impact of certain changes is still here in the new approach! It is relying on the statistics, powered by the data of previous runs.

The issue with relying on only a single run per version is you can't do such statistical analysis, as N=1. Even within a single run, there can be very high volatility between when one benchmark suite is run and another is run. I've found the statistical approach be more reliable than single shot relative benchmarking using wall clock time.

This has the added benefit of halving the PR benchmarking times.

This is very much a positive externality/benefit and not at all the motivation behind the change. I'm sorry for any confusion!

The original reasoning behind the tag based solution is mostly that it is a speed optimization, as everything is "local" to the PR. The worst thing that could happened there was that someone submitted a PR that wast CI time. There are no results that would have been stored for a longer time, so it couldn't influence anything else.

Okay, that makes sense. That has me leaning towards adding "run on tag" to this much safer option: https://bencher.dev/docs/how-to/github-actions/#benchmark-fork-pr-and-upload-from-default-branch
This would mean you only have to tag a PR once as needing to be benchmarked and every subsequent run will automatically be safely benchmarked without needing any additional approval.

epompeii · 2023-12-14T01:29:20Z

I have updated this PR to now use the workflow_run + "run on tag" setup.
In order for workflow_run to actually run, it needs to exist on the default branch for the repository.

Here is a new example PR doing just that: epompeii#3

Let me know what you think!

weiznich · 2023-12-15T14:29:58Z

Thanks for the update.

The issue with relying on only a single run per version is you can't do such statistical analysis, as N=1. Even within a single run, there can be very high volatility between when one benchmark suite is run and another is run. I've found the statistical approach be more reliable than single shot relative benchmarking using wall clock time.

Well, that true if we would have done a single run only, but we use criterion there, which on it's own performs quite a lot of stuff to ensure that we get statistically relevant results by essentially running a certain benchmark several times. I think for some of our benchmarks criterion performs up to several millions individual runs.

For example, under the old system if we had 3 PRs against master for HEAD#1 and then later only 1 PR against HEAD#2 we would have 4 samples for HEAD#1 and only 2 samples for HEAD#2 (1 initial branch push run + each PR run). This means we would be statistically over weighting HEAD#1 in our calculations.

My main issue with this approach is that the old benchmark on HEAD#1 might have run on a completely different runner with different setup, etc. The metrics data indicate that there is quite a lot of difference between these different runs. I totally understand that you can do some statistics there to get a threshold that says: Anything above/below that is a change in the runtime, but my fear is that this threshold is really large. That would render such a system much less useful for me.
My reasoning with the current approach is essentially: Both benchmarks run under very similar conditions (as the run after each other). That removes one potential source of noise and allows us to reduce the threshold for consider something as "change" drastically. The statistic for that is provided by criterion.

epompeii · 2023-12-15T14:44:08Z

My reasoning with the current approach is essentially: Both benchmarks run under very similar conditions (as the run after each other). That removes one potential source of noise and allows us to reduce the threshold for consider something as "change" drastically. The statistic for that is provided by criterion.

Sounds good! I can move things over to using relative benchmarking then: https://bencher.dev/docs/how-to/track-benchmarks/#relative-benchmarking

I will only run the benchmark harness itself once, as criterion obviously has that covered 😃

epompeii · 2023-12-17T04:34:24Z

Okay, I've moved things back to using relative benchmarking.
The PR branch and the diesel master branch are benchmarked one after the other, within the same job.

For security reasons, the results are uploaded as artifacts. Then in a separate workflow_run job they are stored in Bencher and the PR comment is generated.

You can see the results in this example pull request: epompeii#3

weiznich · 2023-12-22T13:39:22Z

Thanks for the update. I wont have the capacity to review this PR before my Christmas holidays, I will try to do this in the new year.

weiznich

Thanks for continue to work on this. It look nearly good now. I left some minor comments about the github action definition. The results + comments look ok for me, although I personally would prefer if the table posted in the comment to the PR would contain the old (main branch) and new (PR branch) timing in an explicit manner + the absolute difference in a third column, but that's a shouldn't block merging this PR.

.github/workflows/run_benches.yml

.github/workflows/track_benches.yml

epompeii · 2024-02-03T13:31:02Z

Thanks for all the great feedback! 😃
Some reviewers prefer to be the one to hit resolve on their comments, so I have held off doing so.
With that said though, I think the only thing left is getting you added to the project!

weiznich · 2024-02-09T12:29:51Z

(I resolved the merge conflict, CI builds will likely fail due to the recent rustc API, will take care of that later)

epompeii · 2024-02-16T15:02:13Z

@weiznich I just wanted to check in. What can I do to help get this PR across the finish line?

weiznich · 2024-02-16T16:10:12Z

There is not much that can currently be done as this is blocked on passing CI, which in turn is blocked on fixing rust-lang/rust#120830 which is essentially just needs some more time to get the PR with the fixed merged there.

diesel-rs/diesel#3849

epompeii mentioned this pull request Nov 14, 2023

Create custom tags for all dimensions bencherdev/bencher#240

Open

epompeii mentioned this pull request Nov 14, 2023

Test PR with Bencher epompeii/diesel#1

Closed

epompeii force-pushed the try_to_fix_metrics_job branch 3 times, most recently from cadacec to a03da24 Compare November 14, 2023 14:42

epompeii changed the title ~~Track master metrics with Bencher~~ Track PR metrics with Bencher Nov 14, 2023

epompeii force-pushed the try_to_fix_metrics_job branch 2 times, most recently from 16150d3 to 9dce159 Compare November 15, 2023 01:02

epompeii force-pushed the try_to_fix_metrics_job branch 3 times, most recently from 9e6c6ff to 03c5e3d Compare December 6, 2023 01:58

weiznich reviewed Jan 12, 2024

View reviewed changes

.github/workflows/run_benches.yml Show resolved Hide resolved

.github/workflows/track_benches.yml Show resolved Hide resolved

.github/workflows/track_benches.yml Outdated Show resolved Hide resolved

weiznich reviewed Feb 2, 2024

View reviewed changes

.github/workflows/track_benches.yml Show resolved Hide resolved

weiznich approved these changes Feb 9, 2024

View reviewed changes

epompeii and others added 9 commits February 22, 2024 08:14

Track master metrics with Bencher

7a9835a

Benchmark PR benches with Bencher

2fa050f

Only benchmark master on push

7da1508

Split out run and upload

5192cb6

Relative Benchmarking

e940723

Bump GHA Upload

424f7b4

upper_boundary

0899f65

Keep Critcmp

12f649d

Use checkout@v4

9dd67d5

weiznich force-pushed the try_to_fix_metrics_job branch from ecdc673 to 9dd67d5 Compare February 22, 2024 07:17

weiznich added the run-benchmarks Used to indicate that github actions should run our benchmark suite label Feb 22, 2024

weiznich added this pull request to the merge queue Feb 22, 2024

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Feb 22, 2024

weiznich added this pull request to the merge queue Feb 23, 2024

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Feb 23, 2024

weiznich added this pull request to the merge queue Feb 23, 2024

Merged via the queue into diesel-rs:master with commit ad8ef47 Feb 23, 2024
51 checks passed

epompeii added a commit to bencherdev/bencher that referenced this pull request Feb 23, 2024

diesel

055605f

diesel-rs/diesel#3849

epompeii deleted the try_to_fix_metrics_job branch February 23, 2024 15:30

epompeii mentioned this pull request Mar 2, 2024

Run DataFusion benchmarks regularly and track performance history over time apache/datafusion#5504

Open

alankritdabral mentioned this pull request Mar 10, 2024

Integrate bencher for benchmarks tailcallhq/tailcall#1300

Closed

BrewTestBot mentioned this pull request May 31, 2024

diesel 2.2.0 Homebrew/homebrew-core#173312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track PR metrics with Bencher #3849

Track PR metrics with Bencher #3849

epompeii commented Nov 10, 2023

weiznich commented Nov 13, 2023

epompeii commented Nov 14, 2023

epompeii commented Nov 15, 2023 •

edited

Loading

weiznich commented Nov 19, 2023

weiznich commented Dec 1, 2023

epompeii commented Dec 3, 2023

epompeii commented Dec 7, 2023

weiznich commented Dec 8, 2023

epompeii commented Dec 8, 2023

epompeii commented Dec 14, 2023 •

edited

Loading

weiznich commented Dec 15, 2023

epompeii commented Dec 15, 2023 •

edited

Loading

epompeii commented Dec 17, 2023

weiznich commented Dec 22, 2023

weiznich left a comment

epompeii commented Feb 3, 2024

weiznich commented Feb 9, 2024

epompeii commented Feb 16, 2024

weiznich commented Feb 16, 2024

Track PR metrics with Bencher #3849

Track PR metrics with Bencher #3849

Conversation

epompeii commented Nov 10, 2023

weiznich commented Nov 13, 2023

epompeii commented Nov 14, 2023

epompeii commented Nov 15, 2023 • edited Loading

weiznich commented Nov 19, 2023

weiznich commented Dec 1, 2023

epompeii commented Dec 3, 2023

epompeii commented Dec 7, 2023

weiznich commented Dec 8, 2023

epompeii commented Dec 8, 2023

epompeii commented Dec 14, 2023 • edited Loading

weiznich commented Dec 15, 2023

epompeii commented Dec 15, 2023 • edited Loading

epompeii commented Dec 17, 2023

weiznich commented Dec 22, 2023

weiznich left a comment

Choose a reason for hiding this comment

epompeii commented Feb 3, 2024

weiznich commented Feb 9, 2024

epompeii commented Feb 16, 2024

weiznich commented Feb 16, 2024

epompeii commented Nov 15, 2023 •

edited

Loading

epompeii commented Dec 14, 2023 •

edited

Loading

epompeii commented Dec 15, 2023 •

edited

Loading