Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track PR metrics with Bencher #3849

Merged
merged 9 commits into from
Feb 23, 2024

Conversation

epompeii
Copy link
Contributor

This PR is a follow up from a discussion with @weiznich on adding the nightly master metrics to Bencher: #3848

A separate Testbed has been created for all three targets in order to distinguish their results (ie ubuntu-latest-postgres, ubuntu-latest-sqlite, and ubuntu-latest-mysql).
Currently only the sqlite metrics are being ingested by Bencher, due to build failures in the other two. Please, let me know if you would like for me to try to fix those in a follow up PR.

The results can be seen here: https://bencher.dev/perf/diesel
Note that it takes a hot second for all 99 results to be returned.
Those results are from a test run on epompeii/diesel: https://github.com/epompeii/diesel/actions/runs/6826611755/job/18566903071#step:10:3477

In order for this to work on diesel-rs/diesel a maintainer will need to add the BENCHER_API_TOKEN as a repository secret. I could either send over the key I setup or add one of the maintainers of diesel to the project on Bencher and you can use your own API token (preferred solution).

@weiznich
Copy link
Member

Thanks for opening this PR. For now I would like to use the bench workflow (that is used to check for regressions in PR's) as experimentation workflow as it contains much less benchmarks and might benefit more from getting comments on PR's and similar improvements.

As a more general note for bencher itself: It would be great if the frontend could group benchmarks from the same group somewhat. We likely do not want to see all benchmarks for all sizes at the same time. Either we would be interested in seeing all insert benchmarks of a certain size for the different crates, or we would be interested in seeing all sizes for one specifc crate.

@epompeii
Copy link
Contributor Author

For now I would like to use the bench workflow (that is used to check for regressions in PR's)

Oh, my misunderstanding. I'm sorry! I'll make the change to the bench workflow.

It would be great if the frontend could group benchmarks from the same group somewhat.

I totally agree! My current thinking is split between two ways of doing this:

  1. Create a way to add tag particular dimensions (branch, testbed, benchmark). Then you could filter benchmarks based on a particular tag, say to view all insert benchmarks you would tag each insert benchmark with operation: insertetc. (Tracking issue: Create custom tags for all dimensions bencherdev/bencher#240)
  2. Add a way to search each dimension (branch, testbed, benchmark) in the perf plot view and then allow you to save/star particular plots (Add a built in way to save plots bencherdev/bencher#231) and even view them in a dashboard like view (Add a built in way to pin plots bencherdev/bencher#232). So you could have an insert benchmarks plot and trivial_query plot, etc. Currently you have to do this with browser bookmarks and multiple windows.

I think either way, I will end up doing 2, so the question then becomes, would it be worth it to also have 1?
I'm very much open to suggestions here if you have any 😃

@epompeii epompeii force-pushed the try_to_fix_metrics_job branch 3 times, most recently from cadacec to a03da24 Compare November 14, 2023 14:42
@epompeii epompeii changed the title Track master metrics with Bencher Track PR metrics with Bencher Nov 14, 2023
@epompeii epompeii force-pushed the try_to_fix_metrics_job branch 2 times, most recently from 16150d3 to 9dce159 Compare November 15, 2023 01:02
@epompeii
Copy link
Contributor Author

epompeii commented Nov 15, 2023

Okay, I have moved things over to the bench workflow!
You can see an example PR where this branch is serving as the target branch: epompeii#1

There are three comments on the PR, one for each testbed: ubuntu-latest-sqlite, ubuntu-latest-mysql, and ubuntu-latest-postgres.

I have an option to only start commenting once there is an Alert is generated (using the --ci-only-on-alert flag). I didn't enable it so you could see the output, but let me know if you want that in the final version.

Currently the "tag to run benchmarks" mechanism is in place, but it is not meant as a security feature. Once the tag is added any subsequent revisions with that tag will also be benchmarked. I can setup GitHub Actions to require a reviewer before each run if you would prefer that. However, the way that I have things set up now should be secure from pwn requests.

@weiznich
Copy link
Member

I just want to leave a comment here that this PR is not forgotten. I'm away for a few days. I try to have a closer look when I'm back.

@weiznich
Copy link
Member

weiznich commented Dec 1, 2023

Thanks for updating the PR to work with the comparing results on PR's. That look helpful.

There are a few things I would like clarify before moving forward:

  • Current the summaries posted to the PR do not contain any indication by how much the performance changed. It would be really helpful to have this information in the PR comments.
  • If I click on any of the benchmark result links I'm asked for a login. All the information there needs to be public, as we cannot expect from contributors to have an account there.

@epompeii
Copy link
Contributor Author

epompeii commented Dec 3, 2023

Thank you for the great feedback!
The example PR has been updated with the requested changes: epompeii#1

Current the summaries posted to the PR do not contain any indication by how much the performance changed. It would be really helpful to have this information in the PR comments.

Great point! I've now added two sets of data points:
Metric Results:

  • The numeric metric result (in this case latency measured in nanoseconds)
  • The percent performance change from the metric result vs the baseline (Δ%)

Metric Boundary:

  • The numeric boundary limit used for calculating if an alert should be generated (in this case an upper boundary)
  • The degree of how close the metric result is to the boundary limit (%)

If I click on any of the benchmark result links I'm asked for a login. All the information there needs to be public, as we cannot expect from contributors to have an account there.

Very true! I have update things so now for all public projects the PR comment only uses public links.
If an alert is generated, then the link from both the PR comment and the public perf plot itself now goes to a public alert page. This is an example from me dogfooding with Bencher: https://bencher.dev/perf/bencher/alerts/67dcb5fb-f164-4252-81f9-d9cae2159d15

I've also split out the benchmarking of the master branch and the PR branches a bit more. This will keep things more statistically sound as the master branch will only be benchmarked once per new push as opposed to on every PR run. This has the added benefit of halving the PR benchmarking times.

@epompeii epompeii force-pushed the try_to_fix_metrics_job branch 3 times, most recently from 9e6c6ff to 03c5e3d Compare December 6, 2023 01:58
@epompeii
Copy link
Contributor Author

epompeii commented Dec 7, 2023

I've been doing a lot of thinking lately about the security implications of running benchmarks from fork PRs.

Was the original intention behind the "run on tag" semantics meant as a security feature (requiring review) or simply as a speed of CI optimization (on run benchmarks on PRs where it would matter)?

If the intention of the "run on tag" was meant as a security feature, I think we should move over to creating a couple of GitHub Environments and add Required Reviewers for fork PRs. I have a write up on this option here: https://bencher.dev/docs/how-to/github-actions/#benchmark-fork-pr-from-target-branch-with-required-reviewers

If the "run on tag" was meant more as a speed of CI optimization, I can work that into this much safer option: https://bencher.dev/docs/how-to/github-actions/#benchmark-fork-pr-and-upload-from-default-branch

Please, let me know what you think, and I will update this PR accordingly!

@weiznich
Copy link
Member

weiznich commented Dec 8, 2023

Thanks for the updates. It's good to see what it would look like.

You mention that this approach now reuses "old" builds for the master branch as optimization. That might become problematic, as it seems like there is really a lot of noise between running benchmark jobs on different action runners. The old approach did run both the current PR and the master PR after each other on the same machine. That would enforce the all benchmarks have a similar environment. It reduces the amount of noise drastically in my opinion, which is the thing that it makes helpful for judging the performance impact of certain changes. I'm not sure that this behavior is still there with the new approach? Maybe you can clarify that?

Was the original intention behind the "run on tag" semantics meant as a security feature (requiring review) or simply as a speed of CI optimization (on run benchmarks on PRs where it would matter)?

The original reasoning behind the tag based solution is mostly that it is a speed optimization, as everything is "local" to the PR. The worst thing that could happened there was that someone submitted a PR that wast CI time. There are no results that would have been stored for a longer time, so it couldn't influence anything else.

@epompeii
Copy link
Contributor Author

epompeii commented Dec 8, 2023

You mention that this approach now reuses "old" builds for the master branch as optimization. That might become problematic, as it seems like there is really a lot of noise between running benchmark jobs on different action runners. The old approach did run both the current PR and the master PR after each other on the same machine. That would enforce the all benchmarks have a similar environment. It reduces the amount of noise drastically in my opinion, which is the thing that it makes helpful for judging the performance impact of certain changes. I'm not sure that this behavior is still there with the new approach? Maybe you can clarify that?

I do need to clarify 😃
The main motivation behind splitting out the benchmarking of the master branch and the PR branches is to keep things more statistically sound (not as an optimization). That is, Bencher keeps track of the benchmark results for a particular branch over time. Running the master branch every time there is a PR would lead to irregular over sampling of the master branch.

For example, under the old system if we had 3 PRs against master for HEAD#1 and then later only 1 PR against HEAD#2 we would have 4 samples for HEAD#1 and only 2 samples for HEAD#2 (1 initial branch push run + each PR run). This means we would be statistically over weighting HEAD#1 in our calculations.

The statistical thresholds used by Bencher are what are meant to help deal with this noise, and we only want a single sample run against any branch per change. So yes, being able to judge the performance impact of certain changes is still here in the new approach! It is relying on the statistics, powered by the data of previous runs.

The issue with relying on only a single run per version is you can't do such statistical analysis, as N=1. Even within a single run, there can be very high volatility between when one benchmark suite is run and another is run. I've found the statistical approach be more reliable than single shot relative benchmarking using wall clock time.

This has the added benefit of halving the PR benchmarking times.

This is very much a positive externality/benefit and not at all the motivation behind the change. I'm sorry for any confusion!

The original reasoning behind the tag based solution is mostly that it is a speed optimization, as everything is "local" to the PR. The worst thing that could happened there was that someone submitted a PR that wast CI time. There are no results that would have been stored for a longer time, so it couldn't influence anything else.

Okay, that makes sense. That has me leaning towards adding "run on tag" to this much safer option: https://bencher.dev/docs/how-to/github-actions/#benchmark-fork-pr-and-upload-from-default-branch
This would mean you only have to tag a PR once as needing to be benchmarked and every subsequent run will automatically be safely benchmarked without needing any additional approval.

@epompeii
Copy link
Contributor Author

epompeii commented Dec 14, 2023

I have updated this PR to now use the workflow_run + "run on tag" setup.
In order for workflow_run to actually run, it needs to exist on the default branch for the repository.

Here is a new example PR doing just that: epompeii#3

Let me know what you think!

@weiznich
Copy link
Member

Thanks for the update.

The issue with relying on only a single run per version is you can't do such statistical analysis, as N=1. Even within a single run, there can be very high volatility between when one benchmark suite is run and another is run. I've found the statistical approach be more reliable than single shot relative benchmarking using wall clock time.

Well, that true if we would have done a single run only, but we use criterion there, which on it's own performs quite a lot of stuff to ensure that we get statistically relevant results by essentially running a certain benchmark several times. I think for some of our benchmarks criterion performs up to several millions individual runs.

For example, under the old system if we had 3 PRs against master for HEAD#1 and then later only 1 PR against HEAD#2 we would have 4 samples for HEAD#1 and only 2 samples for HEAD#2 (1 initial branch push run + each PR run). This means we would be statistically over weighting HEAD#1 in our calculations.

My main issue with this approach is that the old benchmark on HEAD#1 might have run on a completely different runner with different setup, etc. The metrics data indicate that there is quite a lot of difference between these different runs. I totally understand that you can do some statistics there to get a threshold that says: Anything above/below that is a change in the runtime, but my fear is that this threshold is really large. That would render such a system much less useful for me.
My reasoning with the current approach is essentially: Both benchmarks run under very similar conditions (as the run after each other). That removes one potential source of noise and allows us to reduce the threshold for consider something as "change" drastically. The statistic for that is provided by criterion.

@epompeii
Copy link
Contributor Author

epompeii commented Dec 15, 2023

My reasoning with the current approach is essentially: Both benchmarks run under very similar conditions (as the run after each other). That removes one potential source of noise and allows us to reduce the threshold for consider something as "change" drastically. The statistic for that is provided by criterion.

Sounds good! I can move things over to using relative benchmarking then: https://bencher.dev/docs/how-to/track-benchmarks/#relative-benchmarking

I will only run the benchmark harness itself once, as criterion obviously has that covered 😃

@epompeii
Copy link
Contributor Author

Okay, I've moved things back to using relative benchmarking.
The PR branch and the diesel master branch are benchmarked one after the other, within the same job.

For security reasons, the results are uploaded as artifacts. Then in a separate workflow_run job they are stored in Bencher and the PR comment is generated.

You can see the results in this example pull request: epompeii#3

@weiznich
Copy link
Member

Thanks for the update. I wont have the capacity to review this PR before my Christmas holidays, I will try to do this in the new year.

Copy link
Member

@weiznich weiznich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continue to work on this. It look nearly good now. I left some minor comments about the github action definition. The results + comments look ok for me, although I personally would prefer if the table posted in the comment to the PR would contain the old (main branch) and new (PR branch) timing in an explicit manner + the absolute difference in a third column, but that's a shouldn't block merging this PR.

.github/workflows/run_benches.yml Show resolved Hide resolved
.github/workflows/track_benches.yml Show resolved Hide resolved
.github/workflows/track_benches.yml Outdated Show resolved Hide resolved
@epompeii
Copy link
Contributor Author

epompeii commented Feb 3, 2024

Thanks for all the great feedback! 😃
Some reviewers prefer to be the one to hit resolve on their comments, so I have held off doing so.
With that said though, I think the only thing left is getting you added to the project!

@weiznich
Copy link
Member

weiznich commented Feb 9, 2024

(I resolved the merge conflict, CI builds will likely fail due to the recent rustc API, will take care of that later)

@epompeii
Copy link
Contributor Author

@weiznich I just wanted to check in. What can I do to help get this PR across the finish line?

@weiznich
Copy link
Member

There is not much that can currently be done as this is blocked on passing CI, which in turn is blocked on fixing rust-lang/rust#120830 which is essentially just needs some more time to get the PR with the fixed merged there.

@weiznich weiznich added the run-benchmarks Used to indicate that github actions should run our benchmark suite label Feb 22, 2024
@weiznich weiznich added this pull request to the merge queue Feb 22, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Feb 22, 2024
@weiznich weiznich added this pull request to the merge queue Feb 23, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Feb 23, 2024
@weiznich weiznich added this pull request to the merge queue Feb 23, 2024
Merged via the queue into diesel-rs:master with commit ad8ef47 Feb 23, 2024
51 checks passed
epompeii added a commit to bencherdev/bencher that referenced this pull request Feb 23, 2024
@epompeii epompeii deleted the try_to_fix_metrics_job branch February 23, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmarks Used to indicate that github actions should run our benchmark suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants