-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
application-controller constant high CPU use with little activity #6108
Comments
Thank you for providing logs @jessebye . Argo CD has different reconciliation "levels": argo-cd/controller/appcontroller.go Lines 68 to 75 in d9bc6cf
According to logs the application controller periodically stores changed child resources tree to redis. It is supposed to very very lightweight operation though. How big are your applications? E.g. how many resources are part of the application including child resources (like pods, end-points etc) ? |
Our two biggest applications have 80-100 resources each. Most of our other apps range from 10-20 resources each. |
@alexmt why is it doing the level 0 refresh so frequently? I count 513 refreshes of that type in the 40s span of logs above. That's for 69 apps, so it's refreshing every app approximately every 5 seconds. |
It does level 0 refresh every time when any child resource changes ( e.g. when that new pod gets restarted etc. ) . When it happens controller just stores updated snapshot to Redis so that UI can visualize updated resources state. The process is not supposed to consume a lot of CPU. Just checked CPU usage of our internal Agro CD. It manages ~2300 apps and performs ~25k reconciliations per 5 mins. Instances consume ~3300 millicores which seems reasonable. So high number of reconciliations is not an issue. Probably high CPU usage is due to some other reason. Trying to figure out how to debug it. |
@alexmt anything I can do to help on this? Is there a debug flag or some way to capture more debugging info that could help you? Another side effect we've noticed is that the nodes that run argocd application controller start to exhaust their network connections (conntrack table). This happens frequently and only on the node where the app controller is running. Seems likely that it's related to the high CPU use we're seeing. |
@alexmt sorry for the noise, but we are still experiencing this issue and would like to do what we can to resolve it. Please let me know what information we can provide to help move this forward? 🙏 |
I can also confirm we having the same issue. From what I can gather, anytime a resource status changes, a reconciliation occurs. We use Kubernetes External Secrets and KEDA, which both poll and update resource status fields frequently. I have ArgoCD configured to ignore all status fields, but there are enough events flowing through the controller to cause considerable CPU burn, even though the controller appears to opt out of reconciliation. Is it possible to rate limit the number of events that are queued in the controller? |
We are likewise seeing the same behavior where we see a steady state ~5500m CPU utilization with only 173 applications across 8 clusters and an average ~25k reconciliations/5min. Both KEDA and External Secrets are deployed in each of our target workload clusters. |
Hmm, we use KEDA as well. No External Secrets though. I wonder if there is something about the way these operators work that causes this problem in Argo? @Aenima4six2 @lowkeyliesmyth have either of you tried the Argo CD 2.1 release candidate? I noticed in the blog post it mentioned reductions in memory use, faster syncs, and less git requests. Maybe one of those improvements will also help with this issue? 🤞 |
|
@jessebye I also have the same problem, I've upgraded ArgoCD to |
@craqs @Aenima4six2 that's good to know. I guess we won't prioritize updating to 2.1 if this issue is still persistent. @craqs do you use Keda and/or external secrets too? @alexmt Would really like to help get this fixed. We get frequent alerts about our Conntrack tables filling up, and Argo CD is almost always the culprit. It seems to create a ton of network connections, in addition to thrashing the CPU. Even on a small cluster with relatively few apps this happens. Can we help collect some debug logs or metrics to help pinpoint the issue? |
Hello @jessebye , sorry - had to prioritize stabilizing v2.1 so we can release accumulated improvements. Please bear with me. I suspect the root cause is some "noisy" resource that constantly getting updates and triggers frequent reconciliation. The long-term solution is to introduce throttling so the controller could handle it gracefully. Short term we can try to find the resource and use e.g. It is not easy to troubleshoot it remotely. Do you have time to sync up next week in slack so we can troubleshoot it together? |
@jessebye No, I don't use that.
@alexmt Do you have any suggestion how to check it? Recently I briefly tracked |
That describes the behavior of both Keda and ExternalSecrets quite well.
This kubernetes event tracing project would help visualize resource update events by turning them into OTel spans. I don't have any tracing infrastructure setup to get this running but if someone else has it ready to go this might provide some insight. |
How are you working around the |
Hey. I was having the exact same problem and managed to find the resource that was causing trouble. After digging around I found that all apps in the cert-manager namespace were triggering ~60 refreshes per minute. I also found this issue with the ALB ingress controller in kube-system |
@joaosilva15 Thanks for advice! We had same problem, and it was resolved by ignoring a couple of orphaned cm, so cpu usage dropped drastically (6000+ to around 300-500 millicores on argocd-application-controller pod)
Br, Alexey |
I uninstalled cert-manager from our cluster and the CPU usage has definitely dropped. |
Discovered that high CPU is caused by expensive method that is used only when orphaned resources are enabled: argo-cd/controller/appcontroller.go Line 248 in a408e29
The @jessebye , @Aenima4six2 , @agrevtcev could you try v2.2.0-rc1 please ? |
@alexmt I just updated one of our clusters from 2.1.0 to 2.2.0-rc1. The CPU usage has not gone down significantly. I also tested disabling orphaned resource management, and found that a HUGE volume of the level 0 refreshes went away. However, the CPU use did not decrease by a noticeable amount. |
Thank you @jessebye ! I think orphaned resources are expected to cause a lot of level 0 refreshes but should not be expensive. Do you use cluster sharding ? There is a possible bag that causes unnecessary CPU when controller is shared. |
@alexmt we don't use sharding. All of our applications have |
v2.3.0 still has high CPU usage |
v2.3.3 still has high CPU usage - 1,5 core in idle |
#8100 (comment)
|
yes @manlme fully disable orphaned resources monitoring, and as @Vladyslav-Miletskyi said in #8100 (comment), don't forget to restart your application controller pod ! |
Got the high CPU issue too, anyone has a latest solution? |
The behavior can be configured in |
In my case the culprit was this bug in kyverno admission controller. |
@amorozkin you might want to exclude this resource with https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#resource-exclusioninclusion |
Checklist:
argocd version
.Describe the bug
The
application-controller
uses about 3,000 millicores of CPU at idle with ~69 apps. Also, refresh/sync of apps seems very slow. Sometimes the UI displays "Refreshing" for several minutes before the sync actually starts.To Reproduce
--app-resync 1800
.Expected behavior
The
application-controller
uses less CPU when there is no activity, and even under times of load it should not need 3 cores for just 69 apps.Screenshots
n/a
Version
Logs
These are 40 seconds worth of logs from application-controller. Searching for
metrics-server
for example reveals it appears 103 times in the span of those 40 seconds.The text was updated successfully, but these errors were encountered: