-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
argocd applicationcontroller stops working/syncing #11458
argocd applicationcontroller stops working/syncing #11458
Comments
I encountered this same issue (2nd time). I thought the first time was bc of resource starvation, so I implemented CPU requests at 2 cores. This did NOT help the issue...it just seems like that application controller dies at some point. To fix the issue I had to do a rolling restart of the stateful and things were fine. We are only managing around 220 applications with this argo instance.
|
Just experienced the same issue, around 120 applications, application-controller stopped processing changes completely.
Similarly to others, I was able to manually restart the application controller and did not experience anymore issues afterwards. Just before this started for me I saw issues in the application-controller logs like this:
|
We have the Same issues, ArgoCD running on OKD 4.11 argocd: v2.4.8+844f79e Logs are the same as previous comments |
I have seen this issue "in the wild" as well. I suspect there's a deadlock in the status processor code. One question here to y'all folks: Do you manage multiple clusters with the Argo CD instances? Or a single one (e.g. |
@jannfis our argocd manages only one cluster |
My set-up manages around 5 different clusters at the moment, with more incoming. Both EKS-managed clusters, and Kops clusters. |
We are also affected. We are running ArgoCD 2.6.7 into a GKE cluster |
This issue was discussed in the last contributor's meeting and @jaideepr97 will bring it to sig-scalibility group. |
Good, because we just tripped on it (v1.25.7-gke.1000 / ~argocd 2.6.0) and I was going to start asking folks about it... |
v2.6.1+3f143c9 Application Controller Disconnects From All Clusters It Managed We manage multiple clusters. We often see the Application Controller disconnect from all clusters it was managing when one of the clusters it was managing is removed from Argo. We are working on producing this in a lab, so we can provide more details. A simple restart of that Application Controller fixes this scenario Bad Argo Apps Slow Argo to a crawl Another scenario we want to reliably prove in a lab is a case where an Argo App gets into such a bad state, I think it can't talk to the cluster that was removed and it has other major issues like a missing App Project, that it slows the entire Argo instance to a crawl (doesn't stop it, but slows it by 95%) for all Application Controllers, not just the one that was supposed to support the removed cluster. When I open up the Argo App, I see a flurry of red error boxes that go in and out of existence faster than can be read. It is a lot. I wonder if this case has to do with the Argo Server instead of App Controller, because of it affecting all Application Controllers equally. In this case, you can't restart your way out of this. You have to find the offending Argo App and remove it. |
@alexmt can we cherry-pick the gitops-engine upgrade at least to 2.7 to fix this? |
Reopening since I found an almost identical bug that also causes deadlock. Creating PR with the fix in few minutes |
Experienced same problem with OpenShift GitOps Operator v1.11.1 running ArgoCD version 2.9.5. We run into this error on a daily bases. After restarting the application controller ArgoCD works fine for some time before it crashes again. |
Same problem with ArgoCD version v2.10.5+335875d, can we please reopen this issue? |
I'd rather someone file a new issue and link to this. It's a really bad strategy to constantly resurrect generic issues when there are distinct causes for each case. Otherwise, most software projects would only need a handful of issues:
|
Checklist:
argocd version
.Describe the bug
After running for some time, application controller stops working, just sitting there and not watching/syncing objects. We can observe the goroutines within the controller dropping sharply. (see below)
This behaviour has been observed some times in the past, now is happening more frequently with growing load and number of applications, currently about 1 failure every 2 days, with ~1k apps.
(Some applications have selfheal activated with errors within the application, which creates high load, but that will need some more investigation.)
Last occurence was accompanied by a slew of messages like:
retrywatcher.go:130] "Watch failed" err="Get \"https://yadda/apis/apps/v1/namespaces/NAMESPACE_HERE/RESOURCE_NAME_HERE/?allowWatchBookmarks=true&resourceVersion=5645476686&watch=true\": context canceled"
This seems to indicate that these watchers error out, but are not restarted again. This could explain the observed drop in goroutines.
Other argocd instance was not affected, so cluster api issues can be excluded.
To Reproduce
difficult, heisenbug.
run an argocd instance, watch 1k partially broken apps with self heal, wait 2 days.
Expected behavior
When an watchers stop, they should be restarted, or the livenessprobe should fail, so that controller can be restarted by cluster.
Screenshots
Version
Paste the output from `argocd version` here.
$ argocd version
argocd: v2.4.11+3d9e9f2
BuildDate: 2022-08-22T09:13:16Z
GitCommit: 3d9e9f2
GitTreeState: clean
GoVersion: go1.18.5
Compiler: gc
Platform: linux/amd64
FATA[0000] Argo CD server address unspecified
Logs
1000's of:
The text was updated successfully, but these errors were encountered: