-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flakey tests #10807
Comments
This is caused by #10768 - Until kit is fixed or reverted out, this will continue to happen. From your unit test logs:
|
https://github.com/argoproj/argo-workflows/actions/runs/4592799925/jobs/8110154390 It's not related to kit. Are you looking at somewhere else? |
I was looking at the failed e2e test, apologies. |
Also flaky e2e test test-executor https://github.com/argoproj/argo-workflows/actions/runs/4608012677/jobs/8143253843:
|
Flaky test-cli https://github.com/argoproj/argo-workflows/actions/runs/4639608340/jobs/8210973276
|
test-cli: https://github.com/argoproj/argo-workflows/actions/runs/4663229967/jobs/8254446014
|
@GeunSam2 Would you like to take a look at this one? It seems pretty consistent. Another example https://github.com/argoproj/argo-workflows/actions/runs/4670440612/jobs/8270253757 |
Okay I'll check the reason of hooks test failing. |
|
|
|
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
|
Another one: #11384 |
Hooks tests are very flaky. Disabled them for now. Need to investigate potential bugs:
cc @toyamagu-2021 Would you like to help us debug these since you added these tests? (after you wrap up with the UI issues) |
Adding Also this is technically a duplicate issue of #9027. They've got different flakes listed in each, but could consolidate into one issue |
I think we should really prevent these flaky tests being merged in the first place. @terrytangyuan and @agilgur5 what are your opinions on running the test suite in parallel 10 (or so) times and only allowing merging when for all runs the tests passed? If we can launch the jobs in parallel, we shouldn't suffer any wait time increases. We probably would need to pay for the extra compute, but I suspect it'd be cheaper than the person hours that go into dealing with flakey tests. |
Ostensibly yes, but the average wait time would increase since some jobs queue longer than others and some wait on network longer etc. This would probably put us over the limit of parallel jobs more frequently, causing more queueing as well
I don't think this would actually help solve the problem. We're taking somewhat inaccurate flakey tests usually caused by race conditions and taking an even more inaccurate approach to it of "run all tests more times". Most PRs don't even change the tests much, if at all, but they will fail more often with a change like this.
which would cause a lot of hours of investigation or confusion due to the existing flakes that were not caused by new code. that's the current biggest issue, and this would increase that. I think we should be more precise in our approach. So if we wanted to take an approach like this I would recommend one of:
|
I presume we are paying more capacity here, but I can't see the time increasing by that much, sure some pipelines will take a bit longer but thats fine as long as the wait times generally are similar to what we have now.
I see where you are coming from, I kind of elided the fact that when we implement this, we should have no more flakey tests.
This effectively what I am saying I suppose, but to be more precise my suggestion is Some kind of flake test detection would be nice to have as well. |
As far as I know, we're not currently paying anything and are on the free plan. There are concurrency limits that apply per plan (and I believe they apply to the entire GH org, not per repo). If we run 140+ (10 * (13 E2Es + 1 unit tests)) more jobs per run, we will almost certainly hit that limit, which will cause queueing, i.e. some parallel jobs will end up running sequentially, which will definitely increase wait times. It also may increase wait times across the
Tall ask -- will this ever be true? 😅
I wrote to only run the new tests themselves multiple times. But actually, rethinking this, neither of these would be correct; a source code change can cause a test flake in an existing test. E.g. a new unhandled race was introduced. That exact scenario has happened multiple times already I'm still thinking a nightly or weekly job would make more sense than on each PR. |
…er. Fixes argoproj#10807 While investigating the flaky `MetricsSuite/TestMetricsEndpoint` test that's been failing periodically for awhile now, I noticed this in the controller logs ([example](https://github.com/argoproj/argo-workflows/actions/runs/11221357877/job/31191811077)): ``` controller: time="2024-10-07T18:22:14.793Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics" server: time="2024-10-07T18:22:14.793Z" level=info msg="Creating event controller" asyncDispatch=false operationQueueSize=16 workerCount=4 server: time="2024-10-07T18:22:14.800Z" level=info msg="GRPC Server Max Message Size, MaxGRPCMessageSize, is set" GRPC_MESSAGE_SIZE=104857600 server: time="2024-10-07T18:22:14.800Z" level=info msg="Argo Server started successfully on http://localhost:2746" url="http://localhost:2746" controller: I1007 18:22:14.800947 25045 leaderelection.go:260] successfully acquired lease argo/workflow-controller controller: time="2024-10-07T18:22:14.801Z" level=info msg="new leader" leader=local controller: time="2024-10-07T18:22:14.801Z" level=info msg="Generating Self Signed TLS Certificates for Telemetry Servers" controller: time="2024-10-07T18:22:14.802Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics" controller: panic: listen tcp :9090: bind: address already in use controller: controller: goroutine 37 [running]: controller: github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer.func2() controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:94 +0x16a controller: created by github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer in goroutine 36 controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:91 +0x53c 2024/10/07 18:22:14 controller: process exited 25045: exit status 2 controller: exit status 2 2024/10/07 18:22:14 controller: backing off 4s ``` I believe this is a race condition introduced in argoproj#11295. Here's the sequence of events that trigger this: 1. Controller starts 2. Dummy metrics server started on port 9090 3. Leader election takes place and controller starts leading 4. Context for dummy metrics server cancelled 5. Metrics server shuts down 6. Prometheus metrics server started on 9090 The problems is steps 5-6 can happen out-of-order, because the shutdown happens after the contxt is cancelled. Per the docs, "a CancelFunc does not wait for the work to stop" (https://pkg.go.dev/context#CancelFunc). The controller needs to explicitly wait for the dummy metrics server to shut down properly before starting the Prometheus metrics server. There's many ways of doing that, and this uses a `WaitGroup`, as that's the simplest approach I could think of. Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Partial fix for argoproj#10807. Builds are occasionally failing while pulling images from Docker Hub due to rate limiting, e.g. https://github.com/argoproj/argo-workflows/actions/runs/11564257560/job/32189242898: > Oct 28 23:30:52 fv-az802-461 k3s[2185]: E1028 23:30:52.151698 2185 kuberuntime_image.go:53] "Failed to pull image" err="Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" image="minio/minio:RELEASE.2022-11-17T23-20-09Z" As explained in the error, logging in increases the limit. We're already doing this in the "Release" workflow, so this copies that over. Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
I noticed E2E test builds are occasionally failing while pulling images from Docker Hub due to rate limiting. Example:
I entered #13830 to try and fix that by logging in using the same credentials used in the "Release" workflow, but that doesn't work due to security restrictions. It'd be theoretically possible to use docker-cache to cache these images, but that's a lot of complexity. Using self-hosted runners would definitely solve this, and as discussed in #13767 (comment), we're eligible for the CNCF self-hosted runners. However, it seems the ticket to add us is stalled. |
It's best to have CI be as generic as possible; if the whole system relies on CNCF self-hosted runners, then forks won't be able to run CI at all. So workarounds may very well be worthwhile, even a generic backoff on rate limits |
Pre-requisites
:latest
What happened/what you expected to happen?
Unit tests failed: https://github.com/argoproj/argo-workflows/actions/runs/4592799925/jobs/8110154390
Version
latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
See recent CI builds
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: