High CPU usage on application-controller -> Infinite level 1 refresh loop? #13614

rob-howie-depop · 2023-05-16T14:56:01Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

I have ArgoCD deployed via the community chart. This is a single-cluster deployment with 99 applications. These apps are pretty typical Deployment / Ingress / CM / VPA / Service / SA, nothing wild.

When I do a kubectl top, I can see that my argocd-application-controller pod is using ~5 vCPU. I belive that it would be more, but k8s is throttling application-controller CPU

~$ kubectl top po -n argo
NAME                                               CPU(cores)   MEMORY(bytes)   
argo-rollouts-849568885b-58hnm                     2m           26Mi            
argo-rollouts-849568885b-cwqc5                     4m           122Mi           
argo-rollouts-dashboard-774597968c-8rlrq           1m           12Mi            
argocd-application-controller-0                    5121m        443Mi           
argocd-dex-server-757d86854d-s6kg9                 1m           36Mi            
argocd-notifications-controller-68dc675bd9-plfz7   2m           29Mi            
argocd-redis-ha-haproxy-64f4899859-59zgd           40m          81Mi            
argocd-redis-ha-haproxy-64f4899859-85mbk           61m          80Mi            
argocd-redis-ha-haproxy-64f4899859-ckkt4           15m          80Mi            
argocd-redis-ha-server-0                           176m         327Mi           
argocd-redis-ha-server-1                           107m         309Mi           
argocd-redis-ha-server-2                           122m         308Mi           
argocd-repo-server-5744877dbf-b68l7                106m         78Mi            
argocd-repo-server-5744877dbf-blvm5                65m          73Mi            
argocd-server-5d46757c5f-h2hv2                     2m           57Mi            
argocd-server-5d46757c5f-xhljz                     7m           83Mi

I actually have two clusters, and both of these have the same issue.
When I look at the logs I see a massive number of level 1 syncs' occuring, tens' per second, which seems wrong

I have redacted my application names, but its a series of different apps, and the logs here continue ad infinitum (it doesn't stop, I haven't caught it mid-sync)

~$ kubectl logs -n argo argocd-application-controller-0 --all-containers -f | grep "level (1)"
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...

To Reproduce

I am using https://artifacthub.io/packages/helm/argo/argo-cd at version: 5.31.1

I'm not currently set many variables differently from the defaults other than setting up HA for the redis cluster, and SSO for the ArgoCD UI.

ArgoCD is pointing at a giant monorepo that gets updated every couple of mins with application manifests generated by Jenkins.

Expected behavior

I was expecting ~100-200mCPU usage for the argocd-application-controller replicaset pod.

Version

here is my argocd version:
v2.7.1+5e54351.dirty

** Help needed **

I have already checked several related issues:

Issue 6108 is around L0 syncs, which I have checked & are not an issue for my case.

However, this is what lead me to look at the L1 syncs, which I believe to be the root cause. I don't understand fully what triggers L1 Syncs and need a signpost in the right direction.

Per comments in the code, a L1 sync is: Compare live application state against state defined using revision of most recent comparison.

I got this far in the code, but don't have a hollistic understanding of whats going on here

argo-cd/controller/appcontroller.go

Lines 236 to 273 in d9bc6cf

    
           func (ctrl *ApplicationController) handleObjectUpdated(managedByApp map[string]bool, ref v1.ObjectReference) { 
        
           	// if namespaced resource is not managed by any app it might be orphaned resource of some other apps 
        
           	if len(managedByApp) == 0 && ref.Namespace != "" { 
        
           		// retrieve applications which monitor orphaned resources in the same namespace and refresh them unless resource is denied in app project 
        
           		if objs, err := ctrl.appInformer.GetIndexer().ByIndex(orphanedIndex, ref.Namespace); err == nil { 
        
           			for i := range objs { 
        
           				app, ok := objs[i].(*appv1.Application) 
        
           				if !ok { 
        
           					continue 
        
           				} 
        
           				// exclude resource unless it is permitted in the app project. If project is not permitted then it is not controlled by the user and there is no point showing the warning. 
        
           				if proj, err := ctrl.getAppProj(app); err == nil && proj.IsGroupKindPermitted(ref.GroupVersionKind().GroupKind(), true) && 
        
           					!isKnownOrphanedResourceExclusion(kube.NewResourceKey(ref.GroupVersionKind().Group, ref.GroupVersionKind().Kind, ref.Namespace, ref.Name), proj) { 
        
           					managedByApp[app.Name] = false 
        
           				} 
        
           			} 
        
           		} 
        
           	} 
        
           	for appName, isManagedResource := range managedByApp { 
        
           		obj, exists, err := ctrl.appInformer.GetIndexer().GetByKey(ctrl.namespace + "/" + appName) 
        
           		if app, ok := obj.(*appv1.Application); exists && err == nil && ok && isSelfReferencedApp(app, ref) { 
        
           			// Don't force refresh app if related resource is application itself. This prevents infinite reconciliation loop. 
        
           			continue 
        
           		} 
        
           		if !ctrl.canProcessApp(obj) { 
        
           			// Don't force refresh app if app belongs to a different controller shard 
        
           			continue 
        
           		} 
        
           		level := ComparisonWithNothing 
        
           		if isManagedResource { 
        
           			level = CompareWithRecent 
        
           		} 
        
           		ctrl.requestAppRefresh(appName, &level, nil) 
        
           	} 
        
           }

The text was updated successfully, but these errors were encountered:

aaemula · 2023-05-31T08:37:55Z

We encountered the same behavior as well.

The appplication-controller keeps reconcile the level 1 app from the refreshRequestedApps queue.
Looks like the app controller is design to monitor the managed app periodically.

Could we tune the application controller to not monitor the managed app so frequently?
Such as honor the reconcile period configuration.

tramperdk · 2023-06-05T13:34:02Z

We currently have this issue increasingly - the more apps we create the higher the uptime on refreshes ergo the CPU load becomes higher as well.

I want to say we saw this happening when we introduced KEDA to our deployment strategy.

Our deployment has 4 ScaledObjects with a polling interval of 5 which manages 4 HPA.
I see the 8 of them "blink" now and then, but relatively frequently (I'd say 8 "blinks" for every 10 seconds across all 8)

We have tried to exclude HPA and ScaledObject from Diffing via argocd-cm (I've omitted company data)

apiVersion: v1
data:
  resource.compareoptions: |
    ignoreResourceStatusField: all
  resource.customizations.ignoreDifferences.autoscaling_HorizontalPodAutoscaler: |
    jsonPointers:
    - /metadata
    - /status
  resource.customizations.ignoreDifferences.keda.sh_ScaledObject: |
    jsonPointers:
    - /metadata
    - /status

I can't share any logs since customer names appear often, but I can tell you we have about twice as many (controller refresh requested), level (0) than (controller refresh requested), level (1)

We have about 450 lines of logs for 10 milliseconds all being either level 0 or 1 refreshes getRepoObjs stats or Comparing app state (cluster: https://kubernetes.default.svc, namespace: )

Currently, we're on 2700 CPU with 30 apps.

agaudreault · 2023-08-07T18:51:59Z

The behavior can be configured in ignoreResourceUpdates to resolve this issue.

rob-howie-depop added the bug Something isn't working label May 16, 2023

agaudreault mentioned this issue Jun 7, 2023

feat: add ignoreResourceUpdates to reduce controller CPU usage (#13534) #13912

Merged

13 tasks

agaudreault closed this as completed Aug 7, 2023

tebaly mentioned this issue Sep 3, 2024

Issue with spec.syncPolicy.automated.selfHeal Option Ignoring Timeouts(probably) #19289

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage on application-controller -> Infinite level 1 refresh loop? #13614

High CPU usage on application-controller -> Infinite level 1 refresh loop? #13614

rob-howie-depop commented May 16, 2023

aaemula commented May 31, 2023 •

edited

Loading

tramperdk commented Jun 5, 2023 •

edited

Loading

agaudreault commented Aug 7, 2023

High CPU usage on application-controller -> Infinite level 1 refresh loop? #13614

High CPU usage on application-controller -> Infinite level 1 refresh loop? #13614

Comments

rob-howie-depop commented May 16, 2023

aaemula commented May 31, 2023 • edited Loading

tramperdk commented Jun 5, 2023 • edited Loading

agaudreault commented Aug 7, 2023

aaemula commented May 31, 2023 •

edited

Loading

tramperdk commented Jun 5, 2023 •

edited

Loading