Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic retry to GHA nightly build workflows #3652

Closed
ankatiyar opened this issue Feb 26, 2024 · 6 comments
Closed

Add automatic retry to GHA nightly build workflows #3652

ankatiyar opened this issue Feb 26, 2024 · 6 comments
Assignees
Labels
Component: DevOps Issue/PR that addresses automation, CI, GitHub setup

Comments

@ankatiyar
Copy link
Contributor

Description

The nightly build and notification workflow creates a lot of spurious failure notification issues which are usually resolved on a re-run of the failed tests.
Proposal: Add an automatic re-run of the failed tests/entire test suite before it reaches the create issue for the failure of notification step so that only genuine failures create failure issues.

Possible Implementation

I looked into this very briefly and saw these actions, but haven't tried it out yet.

@ankatiyar ankatiyar added the Component: DevOps Issue/PR that addresses automation, CI, GitHub setup label Feb 26, 2024
@ElenaKhaustova
Copy link
Contributor

Checked the solutions possible in our case:

  1. https://github.com/marketplace/actions/retry-step - will not work in our case as it works by re-running a command that you specify, so it will not work directly with other uses directives. This means it is most applicable when you are re-running explicit shell commands but not workflows.
  2. https://www.thisdot.co/blog/how-to-retry-failed-steps-in-github-action-workflows - this solution seems the easiest in our case and can be applied at the checks level, so that we re-run not all-checks but only those that fail. The drawback is that you cannot easily set up re-run parameters (attempts, etc)
  3. https://github.com/Wandalen/wretry.action - does seem to do the right thing but it's a bit harder to apply compared to 2nd option.

From the above decided to try two different solutions: 2 and 3

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Mar 26, 2024

After checking 2 and 3 from the above, we've figured out that:

  • both 2 and 3 allow retrying steps or actions;
  • reusable workflows must be called as jobs: https://github.com/orgs/community/discussions/27362
  • potentially, we can add a retry on one of three levels: test-nightly-build.yml (workflow level), all-checks.yml (workflow level) or for the checks itself, i.e. unit-tests.yml/e2e-tests.yml, ... (steps level)

To make it work at the workflow level, we must also create actions from the workflows. So the plan is to start from the checks and apply retry at the steps level.

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Mar 27, 2024

The current working solution is tracking the job status at the level of the checks (lint.yml, unit-tests.yml, ..) and adding an extra retry job if the main job fails; see an example below for the lint job.

jobs:
  lint:
    runs-on: ${{ inputs.os }}
    steps:
      - name: Checkout code
        continue-on-error: true
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.branch }}
      - name: Set up Python ${{ inputs.python-version }}
        continue-on-error: true
        uses: actions/setup-python@v5
        with:
          python-version: ${{ inputs.python-version }}
      - name: Cache python packages
        continue-on-error: true
        uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: ${{inputs.os}}-python-${{inputs.python-version}}
      - name: Install dependencies
        continue-on-error: true
        run: |
            make install-test-requirements
            make install-pre-commit
      - name: pip freeze
        continue-on-error: true
        run: pip freeze
      - name: Run linter
        continue-on-error: true
        run: make lint
  lint-retry:
    runs-on: ${{ inputs.os }}
    if: ${{ always() && needs.lint.outputs.status != 'success' }}
    needs: lint
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ inputs.branch }}
      - name: Set up Python ${{ inputs.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ inputs.python-version }}
      - name: Cache python packages
        uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: ${{inputs.os}}-python-${{inputs.python-version}}
      - name: Install dependencies
        run: |
          make install-test-requirements
          make install-pre-commit
      - name: pip freeze
        run: pip freeze
      - name: Run linter
        run: make lint

To make it work we add continue-on-error: true for each step, so that execution doesn't stop. We cannot add continue-on-error: true at the job level cause then the job status is failure and the gatekeeper fails as well.

The drawback of the solution is that we copy-paste all the steps which makes it harder to maintain.

We cannot make the same things at the level of the all-checks.yml or nightly-build.yml because GHA doesn't support running multiple workflows within one job and using needs and uses (needed to get the status from the first job and running retry) at the same time. So the solution below will not work.

  lint:
    strategy:
      matrix:
        os: [ ubuntu-latest ]
        python-version: [ "3.11" ]
    uses: ./.github/workflows/lint-retry.yml
    if: ${{ always() && needs.lint.outputs.status != 'success' }}
    needs: lint
    with:
      os: ${{ matrix.os }}
      python-version: ${{ matrix.python-version }}
      branch: ${{ inputs.branch }}

The alternative possible solution is to create a custom action from workflow and retry the action. This solution looks quite complex and there is no confidence that it will work with composite actions, so we might need to create a Docker container action.

Several used sources:

  1. Explanation on why we cannot call reusable workflows as steps: https://github.com/orgs/community/discussions/27362
  2. Explanation on how continue-on-error works depending on where it’s placed: Wrong behaviour when combining 'continue-on-error' and 'failure()' in subsequent steps actions/toolkit#1034
  3. Solution for steps: https://www.thisdot.co/blog/how-to-retry-failed-steps-in-github-action-workflows
  4. Make merge gatekeeper skip some checks: https://github.com/upsidr/merge-gatekeeper/blob/main/docs/action-usage.md

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Mar 27, 2024

From the above and PS with @ankatiyar, it was decided NOT to proceed with any of the described solutions as they seem to bring more difficulties than value.

@merelcht, @SajidAlamQB, what do you think? Please let me know if I'm missing anything or if there is any other possible solution in your mind!

@merelcht
Copy link
Member

merelcht commented Apr 2, 2024

@ElenaKhaustova thanks for investigating this in so much detail and explaining all possibilities. This is definitely a lot more complex than I thought. I agree it's not worth having such a complex retry system at this point in time, because jobs aren't failing that frequently because of flakiness. We can always revisit this if we find that our builds aren't stable enough anymore and we need to retry too often.

@ElenaKhaustova
Copy link
Contributor

Closing issue after research and several discussions, it was decided not to proceed with it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: DevOps Issue/PR that addresses automation, CI, GitHub setup
Projects
Archived in project
Development

No branches or pull requests

3 participants