Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage on application-controller -> Infinite level 1 refresh loop? #13614

Closed
3 tasks done
rob-howie-depop opened this issue May 16, 2023 · 3 comments
Closed
3 tasks done
Labels
bug Something isn't working

Comments

@rob-howie-depop
Copy link

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

I have ArgoCD deployed via the community chart. This is a single-cluster deployment with 99 applications. These apps are pretty typical Deployment / Ingress / CM / VPA / Service / SA, nothing wild.

When I do a kubectl top, I can see that my argocd-application-controller pod is using ~5 vCPU. I belive that it would be more, but k8s is throttling application-controller CPU

~$ kubectl top po -n argo
NAME                                               CPU(cores)   MEMORY(bytes)   
argo-rollouts-849568885b-58hnm                     2m           26Mi            
argo-rollouts-849568885b-cwqc5                     4m           122Mi           
argo-rollouts-dashboard-774597968c-8rlrq           1m           12Mi            
argocd-application-controller-0                    5121m        443Mi           
argocd-dex-server-757d86854d-s6kg9                 1m           36Mi            
argocd-notifications-controller-68dc675bd9-plfz7   2m           29Mi            
argocd-redis-ha-haproxy-64f4899859-59zgd           40m          81Mi            
argocd-redis-ha-haproxy-64f4899859-85mbk           61m          80Mi            
argocd-redis-ha-haproxy-64f4899859-ckkt4           15m          80Mi            
argocd-redis-ha-server-0                           176m         327Mi           
argocd-redis-ha-server-1                           107m         309Mi           
argocd-redis-ha-server-2                           122m         308Mi           
argocd-repo-server-5744877dbf-b68l7                106m         78Mi            
argocd-repo-server-5744877dbf-blvm5                65m          73Mi            
argocd-server-5d46757c5f-h2hv2                     2m           57Mi            
argocd-server-5d46757c5f-xhljz                     7m           83Mi 

I actually have two clusters, and both of these have the same issue.
When I look at the logs I see a massive number of level 1 syncs' occuring, tens' per second, which seems wrong

I have redacted my application names, but its a series of different apps, and the logs here continue ad infinitum (it doesn't stop, I haven't caught it mid-sync)

~$ kubectl logs -n argo argocd-application-controller-0 --all-containers -f | grep "level (1)"
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:34Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:35Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...
time="2023-05-16T14:11:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argo/...

To Reproduce

I am using https://artifacthub.io/packages/helm/argo/argo-cd at version: 5.31.1

I'm not currently set many variables differently from the defaults other than setting up HA for the redis cluster, and SSO for the ArgoCD UI.

ArgoCD is pointing at a giant monorepo that gets updated every couple of mins with application manifests generated by Jenkins.

Expected behavior

I was expecting ~100-200mCPU usage for the argocd-application-controller replicaset pod.

Version

here is my argocd version:
v2.7.1+5e54351.dirty

** Help needed **

I have already checked several related issues:

Issue 6108 is around L0 syncs, which I have checked & are not an issue for my case.

However, this is what lead me to look at the L1 syncs, which I believe to be the root cause. I don't understand fully what triggers L1 Syncs and need a signpost in the right direction.

Per comments in the code, a L1 sync is: Compare live application state against state defined using revision of most recent comparison.

I got this far in the code, but don't have a hollistic understanding of whats going on here

func (ctrl *ApplicationController) handleObjectUpdated(managedByApp map[string]bool, ref v1.ObjectReference) {
// if namespaced resource is not managed by any app it might be orphaned resource of some other apps
if len(managedByApp) == 0 && ref.Namespace != "" {
// retrieve applications which monitor orphaned resources in the same namespace and refresh them unless resource is denied in app project
if objs, err := ctrl.appInformer.GetIndexer().ByIndex(orphanedIndex, ref.Namespace); err == nil {
for i := range objs {
app, ok := objs[i].(*appv1.Application)
if !ok {
continue
}
// exclude resource unless it is permitted in the app project. If project is not permitted then it is not controlled by the user and there is no point showing the warning.
if proj, err := ctrl.getAppProj(app); err == nil && proj.IsGroupKindPermitted(ref.GroupVersionKind().GroupKind(), true) &&
!isKnownOrphanedResourceExclusion(kube.NewResourceKey(ref.GroupVersionKind().Group, ref.GroupVersionKind().Kind, ref.Namespace, ref.Name), proj) {
managedByApp[app.Name] = false
}
}
}
}
for appName, isManagedResource := range managedByApp {
obj, exists, err := ctrl.appInformer.GetIndexer().GetByKey(ctrl.namespace + "/" + appName)
if app, ok := obj.(*appv1.Application); exists && err == nil && ok && isSelfReferencedApp(app, ref) {
// Don't force refresh app if related resource is application itself. This prevents infinite reconciliation loop.
continue
}
if !ctrl.canProcessApp(obj) {
// Don't force refresh app if app belongs to a different controller shard
continue
}
level := ComparisonWithNothing
if isManagedResource {
level = CompareWithRecent
}
ctrl.requestAppRefresh(appName, &level, nil)
}
}

@rob-howie-depop rob-howie-depop added the bug Something isn't working label May 16, 2023
@aaemula
Copy link

aaemula commented May 31, 2023

We encountered the same behavior as well.

The appplication-controller keeps reconcile the level 1 app from the refreshRequestedApps queue.
Looks like the app controller is design to monitor the managed app periodically.

Could we tune the application controller to not monitor the managed app so frequently?
Such as honor the reconcile period configuration.

@tramperdk
Copy link

tramperdk commented Jun 5, 2023

We currently have this issue increasingly - the more apps we create the higher the uptime on refreshes ergo the CPU load becomes higher as well.

I want to say we saw this happening when we introduced KEDA to our deployment strategy.

Our deployment has 4 ScaledObjects with a polling interval of 5 which manages 4 HPA.
I see the 8 of them "blink" now and then, but relatively frequently (I'd say 8 "blinks" for every 10 seconds across all 8)

We have tried to exclude HPA and ScaledObject from Diffing via argocd-cm (I've omitted company data)

apiVersion: v1
data:
  resource.compareoptions: |
    ignoreResourceStatusField: all
  resource.customizations.ignoreDifferences.autoscaling_HorizontalPodAutoscaler: |
    jsonPointers:
    - /metadata
    - /status
  resource.customizations.ignoreDifferences.keda.sh_ScaledObject: |
    jsonPointers:
    - /metadata
    - /status

I can't share any logs since customer names appear often, but I can tell you we have about twice as many (controller refresh requested), level (0) than (controller refresh requested), level (1)

We have about 450 lines of logs for 10 milliseconds all being either level 0 or 1 refreshes getRepoObjs stats or Comparing app state (cluster: https://kubernetes.default.svc, namespace: )

Currently, we're on 2700 CPU with 30 apps.

@agaudreault
Copy link
Member

The behavior can be configured in ignoreResourceUpdates to resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants