Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo CD considers PreSync phase finished even though the Job was just created #10077

Open
2 of 3 tasks
nazarewk opened this issue Jul 22, 2022 · 13 comments
Open
2 of 3 tasks
Labels
bug Something isn't working hooks

Comments

@nazarewk
Copy link
Contributor

nazarewk commented Jul 22, 2022

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

I've noticed a weird behavior blocking our deployments every now and then (couple times a day), where ArgoCD considers PreSync hook finished even though Job was just created. The Jobs Pod hangs up on lack of Secret (through ExternalSecret), that was deleted because PreSync supposedly finished.

I've asked about it on Slack at https://cloud-native.slack.com/archives/C01TSERG0KZ/p1658409808237339 , but didn't find the cause.

Example timeline

  1. 10:36:54 Argo CD created External Secret, ESO created a Secret
  2. 10:37:02 Argo dropped Job
  3. 10:37:04 Argo created a new Job
  4. 10:37:07 Argo deleted ExternalSecret
    1. Secret was garbage collected, then ES recreated it for a split second
  5. 14:12:05 Argo cleaned up the Job (I might have terminated the Sync manually)

Extra info:

  • the Pod (and Job) got stuck indefinitely waiting for a Secret
  • i'm on Argo 2.3.3
  • ExternalSecret is created in wave -2, Job is at 20

Seems more or less related to:

To Reproduce

No idea, it happens randomly in the ballpark of once every 10 deployments

Expected behavior

Argo CD does not consider PreSync phase finished until the Job has started and completed (by either success or failure).

Version
image: quay.io/argoproj/argocd:v2.3.3, not sure why argocd version posts 2.4.4 (it's the CLI's version).

Logs

Google Sheet gathering following information:

  • Kubernetes APIServer logs:
    • all logs related to the application
    • filtered down to creations & modifications
    • filtered down to create/delete
    • extracted suspicious part of logs
  • example ExternalSecret manifest
  • Application Controller logs
Job manifest
apiVersion: batch/v1
kind: Job
metadata:
  name: fa-app-migrate-kt
  labels:
    app.kubernetes.io/instance: fa-app
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: "git-22eREDACTED0d5"
    helm.sh/chart: app-REDACTED
    tags.datadoghq.com/env: "pr-prnum-ns"
    tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns"
    tags.datadoghq.com/service: "fa-app-migrate-kt"
    tags.datadoghq.com/version: "git-22eREDACTED0d5"
    tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt"
    tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5"
    image-tag: "git-22eREDACTED0d5"
    job-type: init
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
    helm.sh/hook-weight: "20"
spec:
  backoffLimit: 5
  template:
    metadata:
      labels:
        job-type: init
        app.kubernetes.io/instance: fa-app
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/version: "git-22eREDACTED0d5"
        helm.sh/chart: app-REDACTED
        
        tags.datadoghq.com/env: "pr-prnum-ns"
        tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns"
        tags.datadoghq.com/service: "fa-app-migrate-kt"
        tags.datadoghq.com/version: "git-22eREDACTED0d5"
        tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt"
        tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5"
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        ad.datadoghq.com/fa-app-migrate-kt-copy.logs: "[{\n  \"source\": \"km-jobs\"\n}]\n"
        ad.datadoghq.com/fa-app-migrate-ktm.logs: "[{\n  \"source\": \"km-jobs\"\n}]\n"
    spec:
      restartPolicy: Never
      initContainers:
        - name: fa-app-migrate-kt-copy
          image: REDACTED/fa-app:git-22eREDACTED0d5
          command: [REDACTED]
          env:
            - name: APP_NAME
              value: fa-app-migrate-kt
          envFrom:
            - configMapRef:
                name: fa-app-hooks
            - secretRef:
                name: fa-app-hooks
          securityContext:
            allowPrivilegeEscalation: false
          volumeMounts:
            - name: kc
              mountPath: /kc
      containers:
        - name: fa-app-migrate-ktm
          image: "REDACTED/km:git-522c04babb111b298d6a897cf12960eb35868082"
          command: [REDACTED]
          env:
            - name: APP_NAME
              value: fa-app-migrate-kt
            - name: KT_FILE
              value: kc/kt.yaml
          
          envFrom:
            - configMapRef:
                name: fa-app-hooks
            - secretRef:
                name: fa-app-hooks
          resources:
            limits:
              cpu: 1
              memory: 512Mi
            requests:
              cpu: 0.5
              memory: 512Mi
          securityContext:
            allowPrivilegeEscalation: false
          volumeMounts:
            - name: kc
              mountPath: /usr/src/app/kc
      nodeSelector:
      
      tolerations:
      
      volumes:
        - name: kc
          emptyDir: {}
@nazarewk nazarewk added the bug Something isn't working label Jul 22, 2022
@nazarewk
Copy link
Contributor Author

Generally it seems like the only way:

  1. Argo CD would not delete ExternalSecret (and by proxy a Secret) unless PreSync was finished
  2. Argo CD would consider PreSync completed if all Jobs were completed
    • Job was just created it could not be completed
  3. Kubernetes would consider Job completed if there was at least 1 Pod that has finished pinned to it
  4. Kubernetes would consider Pod finished when all containers both started and finished
    • could not have happened, because Pod was just created

@OmerKahani
Copy link
Contributor

Hi, can you please add the definition (YAML) of the PreSync job?

@nazarewk
Copy link
Contributor Author

Hi, can you please add the definition (YAML) of the PreSync job?

done, you can find it collapsed at the bottom of description

@nazarewk
Copy link
Contributor Author

nazarewk commented Jul 28, 2022

Looks like the issue happens less frequently after putting Argo CD on half number of twice as large instances and adjusting reservations and limits:
screenshot-2022-07-28_09-39-54
I reconfigured the cluster around 13-15 on Monday 25th, then there were some trailing issues until 18 and almost nothing after that.
Might be related to some race conditions when performance is sub-optimal?

@nazarewk
Copy link
Contributor Author

nazarewk commented Aug 9, 2022

Could it be that events are received/processed out of order?

@nazarewk
Copy link
Contributor Author

Another case of a hiccup. I suspect Argo CD application controller might not be handling cache properly under load? Seems like it thinks it has a job from 3 seconds before (before deletion due to hook deletion policy)?

screenshot-2022-08-12_12-42-15

seems like this is correlated with higher CPU usage on Application Controller:
screenshot-2022-08-12_12-46-51

@nazarewk
Copy link
Contributor Author

Continuing on previous one there is patch on Application just before deletion of Secret, I don't have content of the patch, but I'm pretty sure it could be status update for PreSync completed.
screenshot-2022-08-12_12-56-43

@nazarewk
Copy link
Contributor Author

@nazarewk
Copy link
Contributor Author

nazarewk commented Aug 31, 2022

this might be caused by hardcoded (low amount of) processors, see #10458 for a fix

@nazarewk
Copy link
Contributor Author

nazarewk commented Sep 13, 2022

this is heavily related to:

  1. argo-cd/controller/sync.go

    Lines 465 to 485 in 9fac0f6

    // delayBetweenSyncWaves is a gitops-engine SyncWaveHook which introduces an artificial delay
    // between each sync wave. We introduce an artificial delay in order give other controllers a
    // _chance_ to react to the spec change that we just applied. This is important because without
    // this, Argo CD will likely assess resource health too quickly (against the stale object), causing
    // hooks to fire prematurely. See: https://github.com/argoproj/argo-cd/issues/4669.
    // Note, this is not foolproof, since a proper fix would require the CRD record
    // status.observedGeneration coupled with a health.lua that verifies
    // status.observedGeneration == metadata.generation
    func delayBetweenSyncWaves(phase common.SyncPhase, wave int, finalWave bool) error {
    if !finalWave {
    delaySec := 2
    if delaySecStr := os.Getenv(EnvVarSyncWaveDelay); delaySecStr != "" {
    if val, err := strconv.Atoi(delaySecStr); err == nil {
    delaySec = val
    }
    }
    duration := time.Duration(delaySec) * time.Second
    time.Sleep(duration)
    }
    return nil
    }
    - there is a fix suggested
  2. Argo CD might progress through waves/PostSync prematurely #4669 - original issue
  3. fix: inject artificial delay between sync waves to better support health assessments #4715 - "good-enough" fix for the issue

@nazarewk
Copy link
Contributor Author

note: seems as if the issue stopped happening after we switched to rendering helm templates ahead of time (committing manifests to git + using directory source

nazarewk added a commit to nazarewk/argo-cd that referenced this issue Sep 13, 2022
…ing until recreated

fixes argoproj#10077

Signed-off-by: Krzysztof Nazarewski <3494992+nazarewk@users.noreply.github.com>
nazarewk added a commit to surgeventures/gitops-engine that referenced this issue Sep 13, 2022
fixes argoproj#446
closes argoproj/argo-cd#10579
closes argoproj/argo-cd#10077

Signed-off-by: Krzysztof Nazarewski <3494992+nazarewk@users.noreply.github.com>
nazarewk added a commit to surgeventures/gitops-engine that referenced this issue Sep 13, 2022
- fixes argoproj#446
- closes argoproj/argo-cd#10579
- original issue argoproj/argo-cd#10077

Signed-off-by: Krzysztof Nazarewski <3494992+nazarewk@users.noreply.github.com>
nazarewk added a commit to surgeventures/gitops-engine that referenced this issue Sep 13, 2022
- fixes argoproj#446
- closes argoproj/argo-cd#10579
- original issue argoproj/argo-cd#10077

Signed-off-by: Krzysztof Nazarewski <3494992+nazarewk@users.noreply.github.com>
nazarewk added a commit to surgeventures/gitops-engine that referenced this issue Sep 14, 2022
- fixes argoproj#446
- closes argoproj/argo-cd#10579
- original issue argoproj/argo-cd#10077

Signed-off-by: Krzysztof Nazarewski <3494992+nazarewk@users.noreply.github.com>
thober35 pushed a commit to thober35/gitops-engine that referenced this issue Oct 25, 2023
- fixes argoproj#446
- closes argoproj/argo-cd#10579
- original issue argoproj/argo-cd#10077

Signed-off-by: Krzysztof Nazarewski <3494992+nazarewk@users.noreply.github.com>
thober35 pushed a commit to thober35/gitops-engine that referenced this issue Oct 26, 2023
- fixes argoproj#446
- closes argoproj/argo-cd#10579
- original issue argoproj/argo-cd#10077

Signed-off-by: Krzysztof Nazarewski <3494992+nazarewk@users.noreply.github.com>
thober35 added a commit to thober35/argo-cd that referenced this issue Oct 26, 2023
Signed-off-by: Thomas Bernhard <thomas_bernhard@live.at>
thober35 added a commit to thober35/argo-cd that referenced this issue Oct 27, 2023
Signed-off-by: Thomas Bernhard <thomas_bernhard@live.at>
thober35 pushed a commit to thober35/argo-cd that referenced this issue Nov 15, 2023
thober35 pushed a commit to thober35/argo-cd that referenced this issue Nov 15, 2023
jharmison-redhat added a commit to rh-aiservices-bu/genai-rhoai-poc-template that referenced this issue Jan 17, 2025
jharmison-redhat added a commit to rh-aiservices-bu/genai-rhoai-poc-template that referenced this issue Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hooks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants