Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync loop for Helm Applications that are using post-delete hooks #17117

Open
3 tasks done
ZF-fredericvanlinthoudt opened this issue Feb 7, 2024 · 15 comments
Open
3 tasks done
Labels
bug/in-triage This issue needs further triage to be correctly classified bug Something isn't working component:application-sets Bulk application management related component:core Syncing, diffing, cluster state cache type:bug version:2.11 Latest confirmed affected version is 2.11

Comments

@ZF-fredericvanlinthoudt
Copy link

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

Since we've updated to ArgoCD v2.10.0, we are facing a constant refresh/sync issue with Applications that have a Helm template as source and are using "post-delete" hooks in Helm.
Probably this is related to the new feature that added support for post-delete hooks.
The application diff (see screenshot below) shows that it wants to two post-delete-finalizer.argocd.argoproj.io finalizers from the Application.
This change gets synced but almost instantaneously it gets out-of-sync again with the same diff and repeats the same process over and over again.
On our production ArgoCD instance, with more than 1200 applications, this causes ArgoCD to freeze and not sync any other applications anymore (those other application's sync are just stuck in "waiting to start").

To Reproduce

https://REDACTED.git is a placeholder for a GIT repository that contains directories with Applications

  • Have an ApplicationSet that generates Applications from underlying directories in a GIT repository with auto-sync enabled.
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  labels:
    argocd.argoproj.io/instance: appset-prd
    zf.argocd.ground-services: 'true'
    zf.argocd.group: optional-gloo-release
    zf.argocd.kind: application-set
    zf.argocd.stage: prd
  name: prd-optional-gloo-release
  namespace: argocd
spec:
  generators:
    - matrix:
        generators:
          - clusters:
              selector:
                matchLabels:
                  argocd.argoproj.io/secret-type: cluster
                  zf-gloo: 'true'
                  zf-kind: global
                  zf-stage: prd
          - git:
              directories:
                - path: charts/optional-releases/gloo-release/*
              repoURL: >-
                https://REDACTED.git
              revision: release/prd
  template:
    metadata:
      annotations:
        argocd.argoproj.io/manifest-generate-paths: .
      labels:
        zf.argocd.ground-services: 'true'
        zf.argocd.group: optional-releases
        zf.argocd.kind: app-of-application-set
        zf.argocd.stage: prd
      name: 'app-{{ name }}-{{ path.basename }}'
    spec:
      destination:
        namespace: argocd
        server: 'https://kubernetes.default.svc'
      project: 'project-ground-services-apps-{{ name }}'
      source:
        helm:
          parameters:
            - name: clusterName
              value: '{{ name }}'
            - name: destinationServer
              value: '{{ server }}'
            - name: branch
              value: release/prd
            - name: stage
              value: prd
        path: '{{ path }}'
        repoURL: >-
          https://REDACTED.git
        targetRevision: release/prd
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

Expected behavior

Applications that use post-delete Helm hooks should be synced successfully in one go and should not constantly be synced over and over again when auto-sync is enabled.

Screenshots

image

Version

argocd: v2.10.0+2175939.dirty
BuildDate: 2024-02-06T15:31:31Z
GitCommit: 2175939ed6156ddd743e60f427f7f48118c971bf
GitTreeState: dirty
GoVersion: go1.21.6
Compiler: gc
Platform: linux/amd64
argocd-server: v2.10.0+2175939

Logs

No relevant logs found.

@ZF-fredericvanlinthoudt ZF-fredericvanlinthoudt added the bug Something isn't working label Feb 7, 2024
@pohldk
Copy link

pohldk commented Feb 7, 2024

We also experienced this and since we have Argo CD installed via helm we had fun trying to rollback 😅

@tcpecheanu
Copy link

On our production ArgoCD, with 1000+ applications, after updating to v2.10.0, the sync and refresh buttons completely freeze the UI. We noticed that the application controller used twice as much memory and cpu but also we didn't found any relevant logs.
We had to rollback to v2.9.5.

@jgwest jgwest added the component:core Syncing, diffing, cluster state cache label Feb 10, 2024
@AnubhavSabharwa
Copy link

The sharding is not working in 2.10.0 as it was working in previous versions.
If you try to remove env variable ARGOCD_CONTROLLER_REPLICAS and restart controller

You will see sync and refresh will start working again

@Skoucail
Copy link

Skoucail commented Apr 4, 2024

We experience the same sync loop issue with version 2.10.5.
image

Anyone found a solution for this?
Is it an option to add the 2 finalizers to the Application in git? Or would that break an initial deploy?

@joebowbeer
Copy link
Contributor

Fixed by #18003 ?

@ricardojdsilva87
Copy link

ricardojdsilva87 commented May 20, 2024

Hello,
We started also seeing several Applications on ArgoCD being out of sync constantly with those 2 finalizers as diff.
This started to happen after upgrading from version 2.9.6 to v2.11.0.
After reverting v2.9.6 everything went back to normal.
After the upgrade to v2.11.0 we started seeing every metrics going up (memory usage, CPU usage and also the queue times that were zero). The upgrade occurred around 9AM today May 20th
image
image
image

After installing v2.9.6 everything went back to normal again, please ignore the gap between ~17:35 and ~18:00 we had an issue with the metrics collections.
image
image
image
image

It can be clearly seen that there is a spike in every metric of the application controller (CPU, RAM kubernetes executions) and a drop after reverting to v2.9.6.
We could see an immediate increase in the queue time that remains at zero after reverting the version.

At the moment we have only these metrics for v2.9.6 and v2.11.0. For some reason with other versions our metrics agent is not being able to gather any information, will check what can be done and test with other different versions to see if this issue with the finalisers persists.

Thanks!

UPDATE
Hello,
Just to add more information, regarding the issue.
It seems that v2.9.15 works as v2.9.6, trying out v2.10.10 caused the issues mentioned above so it must be something introduced in v2.10.x. As this version is installed we start seeing the queue increasing and the apps starting a sync loop.
Thanks for the support

@mmalyska
Copy link
Contributor

mmalyska commented Jun 4, 2024

I'm on v2.11.2+25f7504 version and experience the same problems.
I'm stuck on infinite loops if selfHeal is on.
obraz

@antonio-tolentino
Copy link

I've installed the version below and I am facing the same issue:
{
"Version": "v2.11.3+3f344d5",
"BuildDate": "2024-06-06T08:42:00Z",
"GitCommit": "3f344d54a4e0bbbb4313e1c19cfe1e544b162598",
"GitTreeState": "clean",
"GoVersion": "go1.21.9",
"Compiler": "gc",
"Platform": "linux/amd64",
"KustomizeVersion": "v5.2.1 2023-10-19T20:13:51Z",
"HelmVersion": "v3.14.4+g81c902a",
"KubectlVersion": "v0.26.11",
"JsonnetVersion": "v0.20.0"
}

argocd_issue

@alexmt alexmt added bug/in-triage This issue needs further triage to be correctly classified component:application-sets Bulk application management related type:bug labels Jun 26, 2024
@didlawowo
Copy link

got the same with nvidia gpu operator and self heal disabled don't change anything

@ricardojdsilva87
Copy link

The same is still happening in the latest version v2.12.4:
image

@gadiener
Copy link

We are also experiencing this, is there a workaround for that?

@igorivan
Copy link

We're experiencing the same issue with the Falcon sensor, as mentioned in the previous comment. Could you please advise?

@wikka
Copy link

wikka commented Oct 16, 2024

Got also the same issue. Any tips on how to circumvent it?

@lorenzboguhn
Copy link

Hey, i found a possible mitigation in Issue-17433
This ticket is probably a duplicate to this ticket.
TLDR; Just add the following to the argocd-cm to ignore differences in Argocd Applications source comment

resource.customizations.ignoreDifferences.argoproj.io_Application: |
  jqPathExpressions:
    - .metadata.finalizers[]? | select(. == "post-delete-finalizer.argocd.argoproj.io" or . == "post-delete-finalizer.argocd.argoproj.io/cleanup")
    - if (.metadata.finalizers | length) == 0 then .metadata.finalizers else empty end

@ricardojdsilva87
Copy link

Hello, indeed the mentioned snippet stops the post-delete hooks to be considered as a diff.
After enabling this setting the resource usage of the controller is not as high as mentioned before.
image
But the queue still increases:
image
We are using the ArgoCD datadog integration, so these metrics are directly reported by the ArgoCD pods. One metric that we can see that increased alot and might be related are these ones:
image
They seem to be related to the Repository server now. Could this be also related to the queue increasing?
This might be also another issue not related to the post delete hook, but is just happening after upgrading to a release > 2.10.x.
In this release the server-side diff feature was added, but as I know it is disabled by default on the configmap and enabling it with controller.diff.server.side documentation.

I'll post here if I can find anything else new

@andrii-korotkov-verkada andrii-korotkov-verkada added the version:2.11 Latest confirmed affected version is 2.11 label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/in-triage This issue needs further triage to be correctly classified bug Something isn't working component:application-sets Bulk application management related component:core Syncing, diffing, cluster state cache type:bug version:2.11 Latest confirmed affected version is 2.11
Projects
None yet
Development

No branches or pull requests