Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: fix flakiness in update-check in Manage integration check workflow #9787

Merged
merged 1 commit into from
Aug 2, 2024

Conversation

frazarshad
Copy link
Contributor

@frazarshad frazarshad commented Jul 26, 2024

refs: #8937

Description

Addresses the issue with flake in the manage-integration-check.yml workflow. This is due to a race condition where the workflow is executed 3 times in quick succession and is expected to run in order of execution.

The "Integration tests" workflow triggers manage-integration-check.yml 3 times. once for requested, in_progress, and completed statuses of the calling workflow.
In cases where the completed status might happen at the same time as the other two (such as when the "Integration tests" workflow is skipped), the 3 calls of the manage-integration-check.yml workflow might run simultaneously and cause a race condition.

the race condition occurs because the final call of manage-integration-check.yml (the one triggered by the completed status of "Integration tests") expects a github check run to already exist. since this might not be the case in a race condition, it fails

To fix this two things have been done:

  • instead of crashing, the job now creates a new github check run with a completed status
  • since the previous two jobs will run after the final job, they might accidentally overwrite the github check run with the completed status and accidentally create a check run with in_progess status. their filter check has been removed entirely so that they consider completed check runs as well.

Security Considerations

Scaling Considerations

Documentation Considerations

Testing Considerations

Upgrade Considerations

Copy link

codecov bot commented Jul 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.41%. Comparing base (bf791ed) to head (c768823).
Report is 65 commits behind head on master.

Additional details and impacted files
Components Coverage Δ
SwingSet/kernel 73.42% <ø> (+0.06%) ⬆️
ERTP 92.45% <ø> (ø)
Orchestration 94.78% <ø> (+0.34%) ⬆️
swing-store 95.95% <ø> (ø)

see 18 files with indirect coverage changes

Copy link

cloudflare-workers-and-pages bot commented Jul 26, 2024

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: f6b26a3
Status: ✅  Deploy successful!
Preview URL: https://6d015d44.agoric-sdk.pages.dev
Branch Preview URL: https://ci-update-check-issue.agoric-sdk.pages.dev

View logs

@frazarshad frazarshad force-pushed the ci/update-check-issue branch from ab634f2 to a7de897 Compare July 26, 2024 09:37
@frazarshad frazarshad changed the title ci:issue ci: fix flakiness in update-check in Manage integration check workflow Jul 26, 2024
@frazarshad frazarshad marked this pull request as ready for review July 26, 2024 09:51
@frazarshad frazarshad requested a review from mhofman July 26, 2024 09:52
@frazarshad frazarshad force-pushed the ci/update-check-issue branch from a7de897 to c2a351e Compare July 26, 2024 11:15
@mhofman
Copy link
Member

mhofman commented Aug 1, 2024

  • their filter check has been removed entirely so that they consider completed check runs as well.

I need to think through this. In particular, what happens if a check had previously been created from a previous run (e.g. a failed run like a flake), and the integration workflow is re-run. In that case I believe we need to create a new check instead of re-using the previous one.

Is there a mergify-experiment PR on which we can test these cases first?

@frazarshad
Copy link
Contributor Author

I need to think through this. In particular, what happens if a check had previously been created from a previous run (e.g. a failed run like a flake), and the integration workflow is re-run. In that case I believe we need to create a new check instead of re-using the previous one.

Is there a mergify-experiment PR on which we can test these cases first?

@mhofman As you suggested that issue is present. i resolved it in a new commit, by adding the run attempt number to the external id and filtering on that.

I tested this out in this repo https://github.com/frazarshad/agoric-sdk where i have added a 60s sleep to the create check job. forcing the update check job to run first (simulating the situation described in the PR description).

I have also tested out repeating the Integation test workflow multiple times, (increasing the run attempt count.)

as you can see in these two runs (which are triggered by the 3rd re-run of Integration test workflow), even on the 3rd run attempt, we filter out any old checks created and create a new check.
Manage Integration check (completed) (runs first)
Manage Integration check (in-progress)

@mhofman
Copy link
Member

mhofman commented Aug 1, 2024

This is pretty awesome!

This makes me think of another scenario: what happens if the initial attempt takes longer to create the result than for the second attempt? Or for that matter if we have multiple runs but the first one executes this workflow after the second one. I think we would end up with the synthetic result of the first one taking precedence over the second one. Arguably this is an unrelated issue, so out of scope for this PR.

Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm with one small tweak

external_id,
output: {
title: "Integration Test Aggregate Result",
summary: `Synthetic check capturing the result of the <a href="${context.payload.workflow_run.html_url}">integration-test workflow run</a>`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're now explicitly taking into consideration the attempt's number, let's include that detail in the summary, at least if it's not the 1st attempt. Here and in the normal creation place.

@frazarshad frazarshad added the automerge:rebase Automatically rebase updates, then merge label Aug 2, 2024
@frazarshad frazarshad force-pushed the ci/update-check-issue branch from c768823 to f6b26a3 Compare August 2, 2024 07:12
@mergify mergify bot merged commit d12673b into master Aug 2, 2024
79 checks passed
@mergify mergify bot deleted the ci/update-check-issue branch August 2, 2024 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants