Add automatic retry to GHA nightly build workflows #3652

ankatiyar · 2024-02-26T11:24:55Z

Description

The nightly build and notification workflow creates a lot of spurious failure notification issues which are usually resolved on a re-run of the failed tests.
Proposal: Add an automatic re-run of the failed tests/entire test suite before it reaches the create issue for the failure of notification step so that only genuine failures create failure issues.

Possible Implementation

I looked into this very briefly and saw these actions, but haven't tried it out yet.

ElenaKhaustova · 2024-03-19T12:37:19Z

Checked the solutions possible in our case:

https://github.com/marketplace/actions/retry-step - will not work in our case as it works by re-running a command that you specify, so it will not work directly with other uses directives. This means it is most applicable when you are re-running explicit shell commands but not workflows.
https://www.thisdot.co/blog/how-to-retry-failed-steps-in-github-action-workflows - this solution seems the easiest in our case and can be applied at the checks level, so that we re-run not all-checks but only those that fail. The drawback is that you cannot easily set up re-run parameters (attempts, etc)
https://github.com/Wandalen/wretry.action - does seem to do the right thing but it's a bit harder to apply compared to 2nd option.

From the above decided to try two different solutions: 2 and 3

ElenaKhaustova · 2024-03-26T15:44:47Z

After checking 2 and 3 from the above, we've figured out that:

both 2 and 3 allow retrying steps or actions;
reusable workflows must be called as jobs: https://github.com/orgs/community/discussions/27362
potentially, we can add a retry on one of three levels: test-nightly-build.yml (workflow level), all-checks.yml (workflow level) or for the checks itself, i.e. unit-tests.yml/e2e-tests.yml, ... (steps level)

To make it work at the workflow level, we must also create actions from the workflows. So the plan is to start from the checks and apply retry at the steps level.

ElenaKhaustova · 2024-03-27T17:36:19Z

The current working solution is tracking the job status at the level of the checks (lint.yml, unit-tests.yml, ..) and adding an extra retry job if the main job fails; see an example below for the lint job.

jobs:
  lint:
    runs-on: ${{ inputs.os }}
    steps:
      - name: Checkout code
        continue-on-error: true
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.branch }}
      - name: Set up Python ${{ inputs.python-version }}
        continue-on-error: true
        uses: actions/setup-python@v5
        with:
          python-version: ${{ inputs.python-version }}
      - name: Cache python packages
        continue-on-error: true
        uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: ${{inputs.os}}-python-${{inputs.python-version}}
      - name: Install dependencies
        continue-on-error: true
        run: |
            make install-test-requirements
            make install-pre-commit
      - name: pip freeze
        continue-on-error: true
        run: pip freeze
      - name: Run linter
        continue-on-error: true
        run: make lint
  lint-retry:
    runs-on: ${{ inputs.os }}
    if: ${{ always() && needs.lint.outputs.status != 'success' }}
    needs: lint
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.branch }}
      - name: Set up Python ${{ inputs.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ inputs.python-version }}
      - name: Cache python packages
        uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: ${{inputs.os}}-python-${{inputs.python-version}}
      - name: Install dependencies
        run: |
          make install-test-requirements
          make install-pre-commit
      - name: pip freeze
        run: pip freeze
      - name: Run linter
        run: make lint

To make it work we add continue-on-error: true for each step, so that execution doesn't stop. We cannot add continue-on-error: true at the job level cause then the job status is failure and the gatekeeper fails as well.

The drawback of the solution is that we copy-paste all the steps which makes it harder to maintain.

We cannot make the same things at the level of the all-checks.yml or nightly-build.yml because GHA doesn't support running multiple workflows within one job and using needs and uses (needed to get the status from the first job and running retry) at the same time. So the solution below will not work.

  lint:
    strategy:
      matrix:
        os: [ ubuntu-latest ]
        python-version: [ "3.11" ]
    uses: ./.github/workflows/lint-retry.yml
    if: ${{ always() && needs.lint.outputs.status != 'success' }}
    needs: lint
    with:
      os: ${{ matrix.os }}
      python-version: ${{ matrix.python-version }}
      branch: ${{ inputs.branch }}

The alternative possible solution is to create a custom action from workflow and retry the action. This solution looks quite complex and there is no confidence that it will work with composite actions, so we might need to create a Docker container action.

Several used sources:

Explanation on why we cannot call reusable workflows as steps: https://github.com/orgs/community/discussions/27362
Explanation on how continue-on-error works depending on where it’s placed: Wrong behaviour when combining 'continue-on-error' and 'failure()' in subsequent steps actions/toolkit#1034
Solution for steps: https://www.thisdot.co/blog/how-to-retry-failed-steps-in-github-action-workflows
Make merge gatekeeper skip some checks: https://github.com/upsidr/merge-gatekeeper/blob/main/docs/action-usage.md

ElenaKhaustova · 2024-03-27T18:01:15Z

From the above and PS with @ankatiyar, it was decided NOT to proceed with any of the described solutions as they seem to bring more difficulties than value.

@merelcht, @SajidAlamQB, what do you think? Please let me know if I'm missing anything or if there is any other possible solution in your mind!

merelcht · 2024-04-02T08:54:13Z

@ElenaKhaustova thanks for investigating this in so much detail and explaining all possibilities. This is definitely a lot more complex than I thought. I agree it's not worth having such a complex retry system at this point in time, because jobs aren't failing that frequently because of flakiness. We can always revisit this if we find that our builds aren't stable enough anymore and we need to retry too often.

ElenaKhaustova · 2024-04-02T13:41:09Z

Closing issue after research and several discussions, it was decided not to proceed with it for now.

ankatiyar added the Component: DevOps Issue/PR that addresses automation, CI, GitHub setup label Feb 26, 2024

github-actions bot mentioned this issue Mar 1, 2024

Monthly issue metrics report #3671

Open

merelcht added this to the DevOps and DevSetup spring 2024 cleanup milestone Mar 11, 2024

merelcht assigned ElenaKhaustova Mar 18, 2024

ElenaKhaustova closed this as completed Apr 2, 2024

ElenaKhaustova reopened this Apr 2, 2024

ElenaKhaustova closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatic retry to GHA nightly build workflows #3652

Add automatic retry to GHA nightly build workflows #3652

ankatiyar commented Feb 26, 2024

ElenaKhaustova commented Mar 19, 2024

ElenaKhaustova commented Mar 26, 2024 •

edited

Loading

ElenaKhaustova commented Mar 27, 2024 •

edited

Loading

ElenaKhaustova commented Mar 27, 2024 •

edited

Loading

merelcht commented Apr 2, 2024

ElenaKhaustova commented Apr 2, 2024

Add automatic retry to GHA nightly build workflows #3652

Add automatic retry to GHA nightly build workflows #3652

Comments

ankatiyar commented Feb 26, 2024

Description

Possible Implementation

ElenaKhaustova commented Mar 19, 2024

ElenaKhaustova commented Mar 26, 2024 • edited Loading

ElenaKhaustova commented Mar 27, 2024 • edited Loading

ElenaKhaustova commented Mar 27, 2024 • edited Loading

merelcht commented Apr 2, 2024

ElenaKhaustova commented Apr 2, 2024

ElenaKhaustova commented Mar 26, 2024 •

edited

Loading

ElenaKhaustova commented Mar 27, 2024 •

edited

Loading

ElenaKhaustova commented Mar 27, 2024 •

edited

Loading