argocd applicationcontroller stops working/syncing #11458

andreasschramm · 2022-11-28T10:54:56Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

After running for some time, application controller stops working, just sitting there and not watching/syncing objects. We can observe the goroutines within the controller dropping sharply. (see below)
This behaviour has been observed some times in the past, now is happening more frequently with growing load and number of applications, currently about 1 failure every 2 days, with ~1k apps.
(Some applications have selfheal activated with errors within the application, which creates high load, but that will need some more investigation.)

Last occurence was accompanied by a slew of messages like:
retrywatcher.go:130] "Watch failed" err="Get \"https://yadda/apis/apps/v1/namespaces/NAMESPACE_HERE/RESOURCE_NAME_HERE/?allowWatchBookmarks=true&resourceVersion=5645476686&watch=true\": context canceled"

This seems to indicate that these watchers error out, but are not restarted again. This could explain the observed drop in goroutines.

Other argocd instance was not affected, so cluster api issues can be excluded.

To Reproduce
difficult, heisenbug.

run an argocd instance, watch 1k partially broken apps with self heal, wait 2 days.

Expected behavior

When an watchers stop, they should be restarted, or the livenessprobe should fail, so that controller can be restarted by cluster.

Screenshots

Version

Paste the output from `argocd version` here.

$ argocd version
argocd: v2.4.11+3d9e9f2
BuildDate: 2022-08-22T09:13:16Z
GitCommit: 3d9e9f2
GitTreeState: clean
GoVersion: go1.18.5
Compiler: gc
Platform: linux/amd64
FATA[0000] Argo CD server address unspecified

Logs
1000's of:

retrywatcher.go:130] "Watch failed" err="Get \"https://yadda/apis/apps/v1/namespaces/NAMESPACE_HERE/RESOURCE_NAME_HERE/?allowWatchBookmarks=true&resourceVersion=5645476686&watch=true\": context canceled"

The text was updated successfully, but these errors were encountered:

jmmclean · 2023-02-23T13:01:52Z

I encountered this same issue (2nd time). I thought the first time was bc of resource starvation, so I implemented CPU requests at 2 cores. This did NOT help the issue...it just seems like that application controller dies at some point. To fix the issue I had to do a rolling restart of the stateful and things were fine.

We are only managing around 220 applications with this argo instance.

argocd-server: v2.5.4+86b2dde
  BuildDate: 2022-12-06T19:46:25Z
  GitCommit: 86b2dde8e4bf1187acd2b4294e94451cd104dad8
  GitTreeState: clean
  GoVersion: go1.18.8
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:54Z
  Helm Version: v3.10.1+g9f88ccb
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.18.0

d-wierdsma · 2023-03-16T13:47:28Z

Just experienced the same issue, around 120 applications, application-controller stopped processing changes completely.

argocd version
argocd: v2.6.1+3f143c9
  BuildDate: 2023-02-08T18:51:05Z
  GitCommit: 3f143c9307f99a61bf7049a2b1c7194699a7c21b
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64

Similarly to others, I was able to manually restart the application controller and did not experience anymore issues afterwards.

Just before this started for me I saw issues in the application-controller logs like this:

2023-03-15 13:33:28	W0315 17:33:28.173842       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023-03-15 13:33:27	W0315 17:33:27.868821       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023-03-15 13:33:27	W0315 17:33:27.866049       1 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023-03-15 13:32:58	{"level":"info","msg":"warning loading openapi schema: %s","server":"<EKS_CLUSTER_API_URL>","time":"2023-03-15T17:32:58Z"}
2023-03-15 13:32:56	{"level":"info","msg":"warning loading openapi schema: %s","server":"<EKS_CLUSTER_API_URL>","time":"2023-03-15T17:32:56Z"}
2023-03-15 13:32:54	I0315 17:32:54.880850       1 request.go:601] Waited for 1.015990162s due to client-side throttling, not priority and fairness, request: GET:<EKS_CLUSTER_API_URL>/apis/apigateway.aws.upbound.io/v1beta1?timeout=32s
2023-03-15 13:31:58	{"level":"info","msg":"warning loading openapi schema: %s","server":"<EKS_CLUSTER_API_URL>","time":"2023-03-15T17:31:58Z"}
2023-03-15 13:31:56	{"level":"info","msg":"warning loading openapi schema: %s","server":"<EKS_CLUSTER_API_URL>","time":"2023-03-15T17:31:56Z"}

AlexanderWurz · 2023-04-17T07:34:22Z

We have the Same issues, ArgoCD running on OKD 4.11

argocd: v2.4.8+844f79e
BuildDate: 2022-07-29T17:01:39Z
GitCommit: 844f79e
GitTreeState: clean
GoVersion: go1.18.4
Compiler: gc
Platform: linux/amd64

Logs are the same as previous comments

jannfis · 2023-04-17T12:45:42Z

I have seen this issue "in the wild" as well. I suspect there's a deadlock in the status processor code.

One question here to y'all folks: Do you manage multiple clusters with the Argo CD instances? Or a single one (e.g. in-cluster only)?

AlexanderWurz · 2023-04-17T13:00:04Z

@jannfis our argocd manages only one cluster

d-wierdsma · 2023-04-17T13:01:45Z

My set-up manages around 5 different clusters at the moment, with more incoming.

Both EKS-managed clusters, and Kops clusters.

robmonct · 2023-04-20T15:21:27Z

We are also affected. We are running ArgoCD 2.6.7 into a GKE cluster

leoluz · 2023-04-27T17:00:46Z

This issue was discussed in the last contributor's meeting and @jaideepr97 will bring it to sig-scalibility group.

jsoref · 2023-04-27T17:20:34Z

Good, because we just tripped on it (v1.25.7-gke.1000 / ~argocd 2.6.0) and I was going to start asking folks about it...

ericblackburn · 2023-04-27T17:33:26Z

v2.6.1+3f143c9

Application Controller Disconnects From All Clusters It Managed

We manage multiple clusters. We often see the Application Controller disconnect from all clusters it was managing when one of the clusters it was managing is removed from Argo. We are working on producing this in a lab, so we can provide more details.

A simple restart of that Application Controller fixes this scenario

Bad Argo Apps Slow Argo to a crawl

Another scenario we want to reliably prove in a lab is a case where an Argo App gets into such a bad state, I think it can't talk to the cluster that was removed and it has other major issues like a missing App Project, that it slows the entire Argo instance to a crawl (doesn't stop it, but slows it by 95%) for all Application Controllers, not just the one that was supposed to support the removed cluster. When I open up the Argo App, I see a flurry of red error boxes that go in and out of existence faster than can be read. It is a lot. I wonder if this case has to do with the Argo Server instead of App Controller, because of it affecting all Application Controllers equally.

In this case, you can't restart your way out of this. You have to find the offending Argo App and remove it.

crenshaw-dev · 2023-05-16T13:43:32Z

@alexmt can we cherry-pick the gitops-engine upgrade at least to 2.7 to fix this?

alexmt · 2023-05-17T23:20:35Z

Reopening since I found an almost identical bug that also causes deadlock. Creating PR with the fix in few minutes

marcusnh · 2024-02-19T09:18:43Z

Experienced same problem with ArgoCD version 2.9.5:

Is there a fix to avoid this problem in the future?

LeifSchumi · 2024-03-08T08:07:08Z

Experienced same problem with OpenShift GitOps Operator v1.11.1 running ArgoCD version 2.9.5. We run into this error on a daily bases. After restarting the application controller ArgoCD works fine for some time before it crashes again.

ibbw · 2024-04-26T06:07:10Z

Same problem with ArgoCD version v2.10.5+335875d, can we please reopen this issue?

jsoref · 2024-04-26T16:10:52Z

I'd rather someone file a new issue and link to this. It's a really bad strategy to constantly resurrect generic issues when there are distinct causes for each case.

Otherwise, most software projects would only need a handful of issues:

Program Crashes
Program is Slow
Program is Ugly
Program is Buggy

andreasschramm added the bug Something isn't working label Nov 28, 2022

jaideepr97 added the type:scalability Issues related to scalability and performance related issues label Apr 18, 2023

alexmt mentioned this issue May 11, 2023

fix: avoid acquiring lock on mutex and semaphore at the same time to prevent deadlock argoproj/gitops-engine#521

Merged

alexmt closed this as completed in argoproj/gitops-engine#521 May 12, 2023

This was referenced May 16, 2023

chore: consume cluster cache deadlock fix in gitops-engine #13611

Merged

chore: consume cluster cache deadlock fix in gitops-engine [release-2.7] #13612

Merged

chore: Consume cluster cache deadlock fix from gitops-engine [release-2.6] #13613

Merged

alexmt reopened this May 17, 2023

alexmt mentioned this issue May 17, 2023

fix: avoid acquiring lock on two mutexes at the same time to prevent deadlock #13636

Merged

alexmt closed this as completed in #13636 May 18, 2023

jenting mentioned this issue Aug 3, 2023

Add context to "Watch failed" #14134

Open

3 tasks

xamroc mentioned this issue Apr 28, 2024

ArgoCD stops syncing/stuck refreshing #18011

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

argocd applicationcontroller stops working/syncing #11458

argocd applicationcontroller stops working/syncing #11458

andreasschramm commented Nov 28, 2022

jmmclean commented Feb 23, 2023

d-wierdsma commented Mar 16, 2023

AlexanderWurz commented Apr 17, 2023 •

edited

Loading

jannfis commented Apr 17, 2023

AlexanderWurz commented Apr 17, 2023

d-wierdsma commented Apr 17, 2023 •

edited

Loading

robmonct commented Apr 20, 2023

leoluz commented Apr 27, 2023

jsoref commented Apr 27, 2023 •

edited

Loading

ericblackburn commented Apr 27, 2023 •

edited

Loading

crenshaw-dev commented May 16, 2023

alexmt commented May 17, 2023

marcusnh commented Feb 19, 2024

LeifSchumi commented Mar 8, 2024

ibbw commented Apr 26, 2024

jsoref commented Apr 26, 2024

argocd applicationcontroller stops working/syncing #11458

argocd applicationcontroller stops working/syncing #11458

Comments

andreasschramm commented Nov 28, 2022

jmmclean commented Feb 23, 2023

d-wierdsma commented Mar 16, 2023

AlexanderWurz commented Apr 17, 2023 • edited Loading

jannfis commented Apr 17, 2023

AlexanderWurz commented Apr 17, 2023

d-wierdsma commented Apr 17, 2023 • edited Loading

robmonct commented Apr 20, 2023

leoluz commented Apr 27, 2023

jsoref commented Apr 27, 2023 • edited Loading

ericblackburn commented Apr 27, 2023 • edited Loading

crenshaw-dev commented May 16, 2023

alexmt commented May 17, 2023

marcusnh commented Feb 19, 2024

LeifSchumi commented Mar 8, 2024

ibbw commented Apr 26, 2024

jsoref commented Apr 26, 2024

AlexanderWurz commented Apr 17, 2023 •

edited

Loading

d-wierdsma commented Apr 17, 2023 •

edited

Loading

jsoref commented Apr 27, 2023 •

edited

Loading

ericblackburn commented Apr 27, 2023 •

edited

Loading