-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent failures in CI E2E tests #9027
Comments
I have experienced this recently. So here's the datapoint
The timeout seems to be set at around 15m into the running of It may or may not be significant but the whole CI job that passed was 30s under the 20m mark, whereas the run that failed was 3s over 20m. |
@ezk84 Thanks for the data point! |
This comment was marked as resolved.
This comment was marked as resolved.
still hoping to address this when I have time if nobody else does, so keeping it alive |
As an update, the overall GitHub Action timeout for each e2e test was increased from 20m to 25m today, and the timeout passed into "go test" for the Test Suite run by the Action was increased from 15m to 20m (this PR). This should take care of some of the failures, although ultimately we need to address the issue of why the build is so slow. Also, it appears that individual tests can sometimes timeout as well (like this one). |
Here the one more test case
|
Going to take a look at each of these test cases and see if there is any common cause or otherwise. TestParametrizableAds should have been addressed in 57bac33 with an increase in time for WaitForWorkflow() |
Hmm, but it looks like that commit occurred on 7/11, while the test occurred on 7/13 so unfortunately I don't think that fixed it, right? |
That looks correct unfortunately. Will have to investigate this test case again. |
I started a document (accessible by anyone at Intuit) which starts to go into some root causes. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Yeah, this is not stale. What is? Also: https://drewdevault.com/2021/10/26/stalebot.html |
TestArtifactGC is apparently flakey. If somebody sees this please include a link to the CI run. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
1 similar comment
This comment was marked as resolved.
This comment was marked as resolved.
This hasn't had any updates since December '22, so I think this may be ok to close. The more recent fixes, timeouts, retries, etc may have helped with this. Tracking individual test failures as issues may be easier as well, see also the more recent #12832, #12836. Also, if you do add a flakey test as an individual issue or otherwise, please copy+paste the test log like in Bala's comment above or this other comment of mine. GH Actions logs are only kept for a certain period of time, so a permanent error log by way of a comment is helpful for debugging and historical purposes. |
The CI end to end tests often fail, but then pass after an empty commit is added. We need to determine if for each failure the issue is the test or an actual race condition in the code that behaves differently each time.
We can add new occurrences over time here:
test-functional, minimal
test:
TestSubmitWorkflowTemplateWithEnum
what happened:
panic: test timed out after 15m0s
link: https://github.com/argoproj/argo-workflows/runs/7011970939?check_suite_focus=true
test:
TestParametrizableAds
what happened:
Error: "" does not contain "Pod was active on the node longer than the specified deadline"
link: https://github.com/argoproj/argo-workflows/runs/7332294820?check_suite_focus=true
test:
AgentSuite/TestParallel
what happened:
line 67: "Should be true"
link: https://github.com/argoproj/argo-workflows/runs/7698020976?check_suite_focus=true
link: https://github.com/argoproj/argo-workflows/runs/7736452278?check_suite_focus=true
test-cli, mysql
test:
TestCLISuite/TestNodeSuspendResume
what happened:
timeout after 1m at WaitForWorkflow()
link: https://github.com/argoproj/argo-workflows/runs/7365382562?check_suite_focus=true
test:
TestCLISuite/TestWorkflowRetry
what happened:
failure at: assert.True(t, retryTime.Before(&innerStepsPodNode.FinishedAt)), line 866
link: https://github.com/argoproj/argo-workflows/runs/7434857876?check_suite_focus=true
test-executor, minimal
test: N/A
what happened: no test ever got run; timed out after 24m in the "actions/cache@v3" step
link: https://github.com/argoproj/argo-workflows/runs/7435503575?check_suite_focus=true
official issue - Action can wait for at least 27 minutes when no progress is being made on the download actions/cache#810
test:
SignalsSuite/TestStopBehavior
what happened:
signals_test.go:34: timeout after 1m40s waiting for condition
link: https://github.com/argoproj/argo-workflows/runs/7459797029?check_suite_focus=true
test-examples, minimal
examples/arguments-parameters-from-configmap.yaml
error: timed out waiting for the condition on workflows/conditional-artifacts-svhsv
test-api, example
make wait
the action has timed out
The text was updated successfully, but these errors were encountered: