Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argocd applicationcontroller stops working/syncing #11458

Closed
2 of 3 tasks
andreasschramm opened this issue Nov 28, 2022 · 16 comments · Fixed by argoproj/gitops-engine#521 or #13636
Closed
2 of 3 tasks

argocd applicationcontroller stops working/syncing #11458

andreasschramm opened this issue Nov 28, 2022 · 16 comments · Fixed by argoproj/gitops-engine#521 or #13636
Labels
bug Something isn't working type:scalability Issues related to scalability and performance related issues

Comments

@andreasschramm
Copy link

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

After running for some time, application controller stops working, just sitting there and not watching/syncing objects. We can observe the goroutines within the controller dropping sharply. (see below)
This behaviour has been observed some times in the past, now is happening more frequently with growing load and number of applications, currently about 1 failure every 2 days, with ~1k apps.
(Some applications have selfheal activated with errors within the application, which creates high load, but that will need some more investigation.)

Last occurence was accompanied by a slew of messages like:
retrywatcher.go:130] "Watch failed" err="Get \"https://yadda/apis/apps/v1/namespaces/NAMESPACE_HERE/RESOURCE_NAME_HERE/?allowWatchBookmarks=true&resourceVersion=5645476686&watch=true\": context canceled"

This seems to indicate that these watchers error out, but are not restarted again. This could explain the observed drop in goroutines.

Other argocd instance was not affected, so cluster api issues can be excluded.

To Reproduce
difficult, heisenbug.

run an argocd instance, watch 1k partially broken apps with self heal, wait 2 days.

Expected behavior

When an watchers stop, they should be restarted, or the livenessprobe should fail, so that controller can be restarted by cluster.

Screenshots

image
Version

Paste the output from `argocd version` here.

$ argocd version
argocd: v2.4.11+3d9e9f2
BuildDate: 2022-08-22T09:13:16Z
GitCommit: 3d9e9f2
GitTreeState: clean
GoVersion: go1.18.5
Compiler: gc
Platform: linux/amd64
FATA[0000] Argo CD server address unspecified

Logs
1000's of:

retrywatcher.go:130] "Watch failed" err="Get \"https://yadda/apis/apps/v1/namespaces/NAMESPACE_HERE/RESOURCE_NAME_HERE/?allowWatchBookmarks=true&resourceVersion=5645476686&watch=true\": context canceled"
@andreasschramm andreasschramm added the bug Something isn't working label Nov 28, 2022
@jmmclean
Copy link

I encountered this same issue (2nd time). I thought the first time was bc of resource starvation, so I implemented CPU requests at 2 cores. This did NOT help the issue...it just seems like that application controller dies at some point. To fix the issue I had to do a rolling restart of the stateful and things were fine.

We are only managing around 220 applications with this argo instance.

argocd-server: v2.5.4+86b2dde
  BuildDate: 2022-12-06T19:46:25Z
  GitCommit: 86b2dde8e4bf1187acd2b4294e94451cd104dad8
  GitTreeState: clean
  GoVersion: go1.18.8
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:54Z
  Helm Version: v3.10.1+g9f88ccb
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.18.0

image

image

@d-wierdsma
Copy link

Just experienced the same issue, around 120 applications, application-controller stopped processing changes completely.

argocd version
argocd: v2.6.1+3f143c9
  BuildDate: 2023-02-08T18:51:05Z
  GitCommit: 3f143c9307f99a61bf7049a2b1c7194699a7c21b
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64

Similarly to others, I was able to manually restart the application controller and did not experience anymore issues afterwards.

image

Just before this started for me I saw issues in the application-controller logs like this:

2023-03-15 13:33:28	W0315 17:33:28.173842       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023-03-15 13:33:27	W0315 17:33:27.868821       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023-03-15 13:33:27	W0315 17:33:27.866049       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023-03-15 13:32:58	{"level":"info","msg":"warning loading openapi schema: %s","server":"<EKS_CLUSTER_API_URL>","time":"2023-03-15T17:32:58Z"}
2023-03-15 13:32:56	{"level":"info","msg":"warning loading openapi schema: %s","server":"<EKS_CLUSTER_API_URL>","time":"2023-03-15T17:32:56Z"}
2023-03-15 13:32:54	I0315 17:32:54.880850       1 request.go:601] Waited for 1.015990162s due to client-side throttling, not priority and fairness, request: GET:<EKS_CLUSTER_API_URL>/apis/apigateway.aws.upbound.io/v1beta1?timeout=32s
2023-03-15 13:31:58	{"level":"info","msg":"warning loading openapi schema: %s","server":"<EKS_CLUSTER_API_URL>","time":"2023-03-15T17:31:58Z"}
2023-03-15 13:31:56	{"level":"info","msg":"warning loading openapi schema: %s","server":"<EKS_CLUSTER_API_URL>","time":"2023-03-15T17:31:56Z"}

@AlexanderWurz
Copy link

AlexanderWurz commented Apr 17, 2023

We have the Same issues, ArgoCD running on OKD 4.11

argocd: v2.4.8+844f79e
BuildDate: 2022-07-29T17:01:39Z
GitCommit: 844f79e
GitTreeState: clean
GoVersion: go1.18.4
Compiler: gc
Platform: linux/amd64

Logs are the same as previous comments

@jannfis
Copy link
Member

jannfis commented Apr 17, 2023

I have seen this issue "in the wild" as well. I suspect there's a deadlock in the status processor code.

One question here to y'all folks: Do you manage multiple clusters with the Argo CD instances? Or a single one (e.g. in-cluster only)?

@AlexanderWurz
Copy link

@jannfis our argocd manages only one cluster

@d-wierdsma
Copy link

d-wierdsma commented Apr 17, 2023

My set-up manages around 5 different clusters at the moment, with more incoming.

Both EKS-managed clusters, and Kops clusters.

@jaideepr97 jaideepr97 added the type:scalability Issues related to scalability and performance related issues label Apr 18, 2023
@robmonct
Copy link

We are also affected. We are running ArgoCD 2.6.7 into a GKE cluster

@leoluz
Copy link
Collaborator

leoluz commented Apr 27, 2023

This issue was discussed in the last contributor's meeting and @jaideepr97 will bring it to sig-scalibility group.

@jsoref
Copy link
Member

jsoref commented Apr 27, 2023

Good, because we just tripped on it (v1.25.7-gke.1000 / ~argocd 2.6.0) and I was going to start asking folks about it...

@ericblackburn
Copy link
Contributor

ericblackburn commented Apr 27, 2023

v2.6.1+3f143c9

Application Controller Disconnects From All Clusters It Managed

We manage multiple clusters. We often see the Application Controller disconnect from all clusters it was managing when one of the clusters it was managing is removed from Argo. We are working on producing this in a lab, so we can provide more details.

A simple restart of that Application Controller fixes this scenario

Bad Argo Apps Slow Argo to a crawl

Another scenario we want to reliably prove in a lab is a case where an Argo App gets into such a bad state, I think it can't talk to the cluster that was removed and it has other major issues like a missing App Project, that it slows the entire Argo instance to a crawl (doesn't stop it, but slows it by 95%) for all Application Controllers, not just the one that was supposed to support the removed cluster. When I open up the Argo App, I see a flurry of red error boxes that go in and out of existence faster than can be read. It is a lot. I wonder if this case has to do with the Argo Server instead of App Controller, because of it affecting all Application Controllers equally.

In this case, you can't restart your way out of this. You have to find the offending Argo App and remove it.

@crenshaw-dev
Copy link
Member

@alexmt can we cherry-pick the gitops-engine upgrade at least to 2.7 to fix this?

@alexmt
Copy link
Collaborator

alexmt commented May 17, 2023

Reopening since I found an almost identical bug that also causes deadlock. Creating PR with the fix in few minutes

@marcusnh
Copy link

Experienced same problem with ArgoCD version 2.9.5:
image
Is there a fix to avoid this problem in the future?

@LeifSchumi
Copy link

Experienced same problem with OpenShift GitOps Operator v1.11.1 running ArgoCD version 2.9.5. We run into this error on a daily bases. After restarting the application controller ArgoCD works fine for some time before it crashes again.

@ibbw
Copy link

ibbw commented Apr 26, 2024

Same problem with ArgoCD version v2.10.5+335875d, can we please reopen this issue?

@jsoref
Copy link
Member

jsoref commented Apr 26, 2024

I'd rather someone file a new issue and link to this. It's a really bad strategy to constantly resurrect generic issues when there are distinct causes for each case.

Otherwise, most software projects would only need a handful of issues:

  • Program Crashes
  • Program is Slow
  • Program is Ugly
  • Program is Buggy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working type:scalability Issues related to scalability and performance related issues
Projects
None yet