Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent failures in CI E2E tests #9027

Closed
juliev0 opened this issue Jun 22, 2022 · 22 comments
Closed

Intermittent failures in CI E2E tests #9027

juliev0 opened this issue Jun 22, 2022 · 22 comments
Assignees
Labels
area/build Build or GithubAction/CI issues P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug type/tech-debt

Comments

@juliev0
Copy link
Contributor

juliev0 commented Jun 22, 2022

The CI end to end tests often fail, but then pass after an empty commit is added. We need to determine if for each failure the issue is the test or an actual race condition in the code that behaves differently each time.

We can add new occurrences over time here:

@ezk84
Copy link
Contributor

ezk84 commented Jun 24, 2022

I have experienced this recently. So here's the datapoint

The timeout seems to be set at around 15m into the running of Run make test-functional E2E_TIMEOUT=1m STATIC_FILES=false. When it passes it seems to have gone for just a tiny bit less time (like a 2 second difference).

It may or may not be significant but the whole CI job that passed was 30s under the 20m mark, whereas the run that failed was 3s over 20m.

@juliev0
Copy link
Contributor Author

juliev0 commented Jun 24, 2022

@ezk84 Thanks for the data point!

@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Jul 10, 2022
@juliev0
Copy link
Contributor Author

juliev0 commented Jul 10, 2022

still hoping to address this when I have time if nobody else does, so keeping it alive

@stale stale bot removed the problem/stale This has not had a response in some time label Jul 11, 2022
@juliev0
Copy link
Contributor Author

juliev0 commented Jul 16, 2022

As an update, the overall GitHub Action timeout for each e2e test was increased from 20m to 25m today, and the timeout passed into "go test" for the Test Suite run by the Action was increased from 15m to 20m (this PR). This should take care of some of the failures, although ultimately we need to address the issue of why the build is so slow.

Also, it appears that individual tests can sometimes timeout as well (like this one).

@sarabala1979
Copy link
Member

sarabala1979 commented Aug 15, 2022

Here the one more test case
https://github.com/argoproj/argo-workflows/runs/7816710269?check_suite_focus=true

FAIL: TestCLISuite/TestLogProblems (27.56s)
=== RUN   TestCLISuite/TestLogProblems
Submitting workflow  log-problems-
Waiting 1m0s for workflow metadata.name=log-problems-4d4nk
 ? log-problems-4d4nk Workflow 0s      

 ● log-problems-4d4nk   Workflow  0s      
 └ ● [0]                StepGroup 0s      
 └ ● log-problems-4d4nk Steps     0s      
 └ ◷ report-1           Pod       0s      

Condition "to start" met after 0s
../../dist/argo -n argo logs @latest --follow

@alexec alexec added area/build Build or GithubAction/CI issues and removed type/bug triage labels Sep 5, 2022
@dpadhiar
Copy link
Member

dpadhiar commented Sep 7, 2022

Going to take a look at each of these test cases and see if there is any common cause or otherwise.

TestParametrizableAds should have been addressed in 57bac33 with an increase in time for WaitForWorkflow()

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 8, 2022

TestParametrizableAds should have been addressed in 57bac33 with an increase in time for WaitForWorkflow()

Hmm, but it looks like that commit occurred on 7/11, while the test occurred on 7/13 so unfortunately I don't think that fixed it, right?

@dpadhiar
Copy link
Member

dpadhiar commented Sep 8, 2022

TestParametrizableAds should have been addressed in 57bac33 with an increase in time for WaitForWorkflow()

Hmm, but it looks like that commit occurred on 7/11, while the test occurred on 7/13 so unfortunately I don't think that fixed it, right?

That looks correct unfortunately. Will have to investigate this test case again.

@juliev0
Copy link
Contributor Author

juliev0 commented Sep 30, 2022

I started a document (accessible by anyone at Intuit) which starts to go into some root causes.

@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Oct 29, 2022
@juliev0

This comment was marked as resolved.

@stale stale bot removed the problem/stale This has not had a response in some time label Oct 29, 2022
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Nov 13, 2022
@juliev0

This comment was marked as resolved.

@stale stale bot removed the problem/stale This has not had a response in some time label Nov 14, 2022
@scravy
Copy link
Contributor

scravy commented Nov 14, 2022

Yeah, this is not stale. What is? Also: https://drewdevault.com/2021/10/26/stalebot.html

@juliev0 juliev0 mentioned this issue Dec 14, 2022
1 task
@juliev0
Copy link
Contributor Author

juliev0 commented Dec 14, 2022

TestArtifactGC is apparently flakey. If somebody sees this please include a link to the CI run.

@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Dec 31, 2022
@juliev0

This comment was marked as resolved.

@stale stale bot removed the problem/stale This has not had a response in some time label Jan 1, 2023
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Jan 21, 2023
@juliev0

This comment was marked as resolved.

1 similar comment
@vosferatu

This comment was marked as resolved.

@stale stale bot removed the problem/stale This has not had a response in some time label Jun 15, 2023
@agilgur5 agilgur5 added type/bug P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Sep 8, 2023
@agilgur5
Copy link
Contributor

agilgur5 commented Mar 24, 2024

This hasn't had any updates since December '22, so I think this may be ok to close. The more recent fixes, timeouts, retries, etc may have helped with this.

Tracking individual test failures as issues may be easier as well, see also the more recent #12832, #12836.

Also, if you do add a flakey test as an individual issue or otherwise, please copy+paste the test log like in Bala's comment above or this other comment of mine. GH Actions logs are only kept for a certain period of time, so a permanent error log by way of a comment is helpful for debugging and historical purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build Build or GithubAction/CI issues P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug type/tech-debt
Projects
None yet
Development

No branches or pull requests

8 participants