Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[testing] Reduce flaky tests by retrying git failures #9907

Closed
1 task done
Tracked by #7808
emmyoop opened this issue Apr 12, 2024 · 5 comments · Fixed by #10137, #10178 or #10179 · May be fixed by #10437
Closed
1 task done
Tracked by #7808

[testing] Reduce flaky tests by retrying git failures #9907

emmyoop opened this issue Apr 12, 2024 · 5 comments · Fixed by #10137, #10178 or #10179 · May be fixed by #10437
Assignees
Labels
tech_debt Behind-the-scenes changes, with little direct impact on end-user functionality user docs [docs.getdbt.com] Needs better documentation

Comments

@emmyoop
Copy link
Member

emmyoop commented Apr 12, 2024

Housekeeping

  • I am a maintainer of dbt-core

Short description

We have a lot of tests that are failing because of Git connection issues. Sometimes tox fails to install all dependencies and that causes the entire test run to fail without actually running any tests. This makes our monitoring noisy.

Suggested approach: leveraging something the nick-fields/retry@v3 action (example but in the tox invocation here)

Acceptance criteria

Anytime we use git when testing, have retry logic

Suggested Tests

This task is specifically for tests

-- can force a test to fail in a commit & observe the retry works as expected at the integration group level

Impact to Other Teams

Adapters team won't be impacted but may be interested if we come up with a solution

Will backports be required?

backport as far as we can to reduce this noise

Context

log output from test failing on tox

Run tox -- --ddtrace
integration: install_deps> python -I -m pip install -r dev-requirements.txt -r editable-requirements.txt
  Running command git clone --filter=blob:none --quiet https://github.com/dbt-labs/dbt-adapters.git /tmp/pip-req-build-g9zkv3vu
  error: RPC failed; curl 16 Error in the HTTP2 framing layer
  fatal: expected 'packfile'
  fatal: could not fetch 22b2ad3f683cca452f28320c0aba8bb95933ca6e from promisor remote
Collecting git+https://github.com/dbt-labs/dbt-adapters.git@main (from -r dev-requirements.txt (line 1))
  Cloning https://github.com/dbt-labs/dbt-adapters.git (to revision main) to /tmp/pip-req-build-g9zkv3vu
integration: exit 1 (2.55 seconds) /home/runner/work/dbt-core/dbt-core> python -I -m pip install -r dev-requirements.txt -r editable-requirements.txt pid=1980
  warning: Clone succeeded, but checkout failed.
  You can inspect what was checked out with 'git status'
  and retry with 'git restore --source=HEAD :/'

  error: subprocess-exited-with-error
  
  × git clone --filter=blob:none --quiet https://github.com/dbt-labs/dbt-adapters.git /tmp/pip-req-build-g9zkv3vu did not run successfully.
  │ exit code: 1[28](https://github.com/dbt-labs/dbt-core/actions/runs/8633734890/job/23667503237#step:8:29)
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/dbt-labs/dbt-adapters.git /tmp/pip-req-build-g9zkv3vu did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
  integration: FAIL code 1 (5.44 seconds)
  evaluation failed :( (5.50 seconds)

Error: Process completed with exit code 1.

Sample of tests marked as flaky but are likely just connection issues. There may not be a solution when there's a longer GitHub outage. Look through #7808 for other possible failures.

#9906

max retries exceeded
#9905
#9903

timeout
#9902
#9900

Note: integration tests are run with the workflow_dispatch trigger in scheduled testing here. typically it would be run with workflow_call trigger but isn't because it's special (comment)

@emmyoop emmyoop added user docs [docs.getdbt.com] Needs better documentation tech_debt Behind-the-scenes changes, with little direct impact on end-user functionality labels Apr 12, 2024
@MichelleArk
Copy link
Contributor

MichelleArk commented Apr 16, 2024

From refinement:

  • At what level should the retry logic live? Options: GH workflow (all we can really do if we fail at the tox step), using existing retry/fallback code
  • Could consider marking all gh-sensitive tests into a group and running them on their own test worker

@emmyoop
Copy link
Member Author

emmyoop commented Apr 16, 2024

Hit this again on 1.3 and 1.4 today

@aranke
Copy link
Member

aranke commented May 8, 2024

@emmyoop It looks like pip already retries network connections up to 5 times: https://pip.pypa.io/en/stable/cli/pip/#cmdoption-retries

Given this information, I'm not sure if adding retries to our test runner (tox in this case) would improve the situation.

Similar issue in a GCP repo: GoogleCloudPlatform/python-docs-samples#3485 (comment)

Thoughts?

@FishtownBuildBot
Copy link
Collaborator

Opened a new issue in dbt-labs/docs.getdbt.com: dbt-labs/docs.getdbt.com#5504

github-actions bot pushed a commit that referenced this issue May 14, 2024
github-actions bot pushed a commit that referenced this issue May 14, 2024
aranke added a commit that referenced this issue May 20, 2024
aranke added a commit that referenced this issue May 20, 2024
aranke added a commit that referenced this issue May 20, 2024
aranke added a commit that referenced this issue May 20, 2024
aranke added a commit that referenced this issue May 20, 2024
aranke added a commit that referenced this issue May 21, 2024
…ts due to network failures (#10178)

* [Backport 1.0.latest] Fix #9907: Add retry to tox to reduce flaky tests due to network failures

* Update main.yml
aranke added a commit that referenced this issue May 21, 2024
…ts due to network failures (#10179)

* [Backport 1.1.latest] Fix #9907: Add retry to tox to reduce flaky tests due to network failures

* Update main.yml
aranke added a commit that referenced this issue May 21, 2024
aranke added a commit that referenced this issue May 21, 2024
aranke added a commit that referenced this issue May 21, 2024
aranke added a commit that referenced this issue May 21, 2024
aranke added a commit that referenced this issue May 21, 2024
aranke added a commit that referenced this issue May 21, 2024
…ts due to network failures (#10143)

(cherry picked from commit 751139d)

Co-authored-by: Kshitij Aranke <kshitij.aranke@dbtlabs.com>
aranke added a commit that referenced this issue May 21, 2024
…ts due to network failures (#10142)

(cherry picked from commit 751139d)

Co-authored-by: Kshitij Aranke <kshitij.aranke@dbtlabs.com>
@mirnawong1
Copy link
Contributor

hey @aranke , it looks like this opened a docs issue -- can I double check what customer-facing changes are needed? from skimming this issue, it looks like this is more internal testing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment