Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes approves the wrong commit #174

Open
ehuss opened this issue Jul 1, 2022 · 9 comments
Open

Sometimes approves the wrong commit #174

ehuss opened this issue Jul 1, 2022 · 9 comments

Comments

@ehuss
Copy link

ehuss commented Jul 1, 2022

There have been two circumstances where a normal @bors r+ approved an old commit:

My hunch is that somehow bors missed the push notification, and the state.head_sha doesn't get updated to the latest version.

I think Homu should never approve a commit that is not the latest commit. I imagine it should check what the latest commit is when approving instead of assuming that the database is in sync. There's definitely risks about race conditions here that may not be solvable, but some extra effort might make it more resilient.

What's scary is that this may be happening and nobody notices. These two instances only got noticed because the old commit failed in CI.

@pinkforest
Copy link

pinkforest commented Jul 1, 2022

Related Zulip:
https://rust-lang.zulipchat.com/#narrow/stream/242791-t-infra/topic/Bors.20races.20.2F.20out.20of.20sync/near/288119101

So homu is the result of GitHub REST API rate limitation if I understood it correctly where as bors-ng is stateless and homu stateful relying entirely on the ingress webhooks for state synchronisation - which seems to have failed few or more times.

I am unsure if there is some type of retry mechanism in the ingress GH webhook calls and whether homu honors retries or whether this is something that could be configured on GH side or make Homu support - which could provide easy low hanging fruit fix to increase stability - webhooks are supposed to get status back until they are delivered
EDIT: Yes there is new API for exactly that ^^^ 🥳

1 - Figure out the impact first

I am going to see if I can do a GraphQL query/ies to scrape off all the PR r+ comments/commits and correlate to figure out how many times - and exactly where - homu has picked up the wrong commit vs what the latest commit was given PR

That analysis would / will allow us to see which and how many PRs are affected and then see how much of an issue this is.

2 - Potentially fix something

Then to potentially fix this in Homu it would mean either - depending on reliability / real impact factors as above

  1. Using GraphQL queries to ensure the state is synchronsised that doesn't hurt rate limits - if impact was high
  • or alternatively low hanging fruit if impact was low -
  1. Figure out why webhook calls are being lost and maybe put some more reliability tlc around those

@pinkforest
Copy link

pinkforest commented Jul 1, 2022

Re: GitHub delivery tracking API

Seems GitHub has introduced API exactly for what we might need re 2.2) ..

https://github.blog/changelog/2021-06-30-webhook-deliveries-api/

https://docs.github.com/en/rest/reference/repos#webhooks

Can someone go see what the situation looks for 30 last days?

Also it may as well be that simple "healthcheck" polling monitor that can trigger the retries via this new API in homu might greatly improve webhook deliverability if required ?

And if the "healthcheck" monitor simply reports to Homu that there are undelivered webhooks it would hang on before acting on things..

@Mark-Simulacrum
Copy link
Member

I think rather than focusing on approval time, we should probably have a check at (bors) merge commit generation time that the approved source commit is equivalent to the PR head commit at that time. That also helps us avoid issues with further pushes being missed.

I don't think focusing on missed webhooks is going to work well; GitHub has just not delivered things in the past for some time, we shouldn't depend on that for reliability here.

@pinkforest
Copy link

pinkforest commented Jul 1, 2022

That would require querying GH one per commit generation for those but I guess there isn't too many of those that would break the rate limit as was the reason for Homu to exist
Also that would be just to provide a failure and feedback mechanism but not really fixing the sync issue that happens at approval time
Are we happy with that?
Or do we mean that homu will pick nearest-before-approval commit regardless what was supposed approved commit (out of sync) without providing feedback about some internal sync failure e.g. just deal with it?

@Mark-Simulacrum
Copy link
Member

We would post an error message in that case. There's no real issue with rate limits -- we're only generating new merge commits probably roughly <15 times/day, so checking at that time is very cheap.

I think there's more work that can be done to improve homu's synchronization with real state, but that's a separate issue from this one and doesn't need to be coupled.

@pinkforest
Copy link

pinkforest commented Jul 1, 2022

I'll also try to finish the impact analysis during the weekend off curiosity -

just needs some regular expression on bors messages and comparing them to commit history

I just got some quick GraphQL running on GH API for doing analysis what has been going on:
https://github.com/pinkforest/rust-gh-homu-off-syncs

---- PR(true): Add a `--build-dir` flag to rustbuild
 79f8dc0 - commit
bors - :pushpin: Commit 79f8dc0b898b0a387df684a539cd97446a0f964f has been approved by `jyn514`

<!-- @bors r=jyn514 79f8dc0b898b0a387df684a539cd97446a0f964f -->
<!-- homu: {"type":"Approved","sha":"79f8dc0b898b0a387df684a539cd97446a0f964f","approver":"jyn514"} -->
---- PR(true): Add macro_rules! rustdoc change to 1.62 relnotes
 4ea18cc - commit
bors - :v: @ can now approve this pull request
<!-- homu: {"type":"Delegated","delegator":"jyn514","delegate":""} -->
bors - :v: @CAD97 can now approve this pull request
<!-- homu: {"type":"Delegated","delegator":"jyn514","delegate":"CAD97"} -->
bors - :pushpin: Commit ab437ad5e46db30f1dc08d21cd73f9ef9ffa13b5 has been approved by `Mark-Simulacrum`

<!-- @bors r=Mark-Simulacrum ab437ad5e46db30f1dc08d21cd73f9ef9ffa13b5 -->
<!-- homu: {"type":"Approved","sha":"ab437ad5e46db30f1dc08d21cd73f9ef9ffa13b5","approver":"Mark-Simulacrum"} -->
bors - :pushpin: Commit 4ea18ccf7e4e4afe213c2b3a74c558135c423fde has been approved by `jyn514`

<!-- @bors r=jyn514 4ea18ccf7e4e4afe213c2b3a74c558135c423fde -->
<!-- homu: {"type":"Approved","sha":"4ea18ccf7e4e4afe213c2b3a74c558135c423fde","approver":"jyn514"} -->
---- PR(true): Rollup of 9 pull requests
 1d845bd - commit
 c6f362a - commit
 ddb6313 - commit
 debee1e - commit
 8931fbd - commit
 e043821 - commit
 c29e584 - commit
 cd7bd8b - commit
 3fcf84a - commit
 d791310 - commit
 79f8dc0 - commit
 4ea18cc - commit
 0d5636c - commit
 41e7991 - commit
 0420231 - commit
 e59693a - commit
 734f21c - commit
 80dd48b - commit
 c4acd06 - commit
 335e7d3 - commit
 18d4228 - commit
bors - :pushpin: Commit 18d4228456a98fd6d8950f74fd117aba7fb45757 has been approved by `matthiaskrgr`

<!-- @bors r=matthiaskrgr 18d4228456a98fd6d8950f74fd117aba7fb45757 -->
<!-- homu: {"type":"Approved","sha":"18d4228456a98fd6d8950f74fd117aba7fb45757","approver":"matthiaskrgr"} -->
bors - :hourglass: Testing commit 18d4228456a98fd6d8950f74fd117aba7fb45757 with merge 7e2733bb1dd9afe5fd20370ca4d539d42ac50419...
<!-- homu: {"type":"BuildStarted","head_sha":"18d4228456a98fd6d8950f74fd117aba7fb45757","merge_sha":"7e2733bb1dd9afe5fd20370ca4d539d42ac50419"} -->
bors - :sunny: Test successful - [checks-actions](https://github.com/rust-lang-ci/rust/runs/7148849106?check_suite_focus=true)
Approved by: matthiaskrgr
Pushing 7e2733bb1dd9afe5fd20370ca4d539d42ac50419 to master...
<!-- homu: {"type":"BuildCompleted","approved_by":"matthiaskrgr","base_ref":"master","builders":{"checks-actions":"https://github.com/rust-lang-ci/rust/runs/7148849106?check_suite_focus=true"},"merge_sha":"7e2733bb1dd9afe5fd20370ca4d539d42ac50419"} -->

After that I can push a PR for Homu to do that merge commit commit sync check

@notriddle
Copy link
Contributor

if I understood it correctly where as bors-ng is stateless and homu stateful

Unfortunately, no. There are three borses, not two.

bors bot added a commit to bors-ng/bors-ng that referenced this issue Jul 10, 2022
1520: fix(batcher): verify the PR commit before merging r=notriddle a=notriddle

Fixes #1519

CC rust-lang/homu#174 (comment)

Co-authored-by: Michael Howell <michael@notriddle.com>
@pinkforest
Copy link

pinkforest commented Jul 17, 2022

Raised #178 for suspected general WebHook delivery instability which addressing commit mixups alone would not fix (this ticket)

@tgross35
Copy link

tgross35 commented Mar 1, 2024

For reference, here is another case where this (or something similar) happened rust-lang/rust#119748 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants