Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgoCD app health does not respect Hooked Job's result, custom LUA check doesn't work either #9861

Open
3 tasks done
mdrakiburrahman opened this issue Jul 2, 2022 · 15 comments
Labels
bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality bug Something isn't working component:argo-cd type:bug

Comments

@mdrakiburrahman
Copy link

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version:

Describe the bug

TLDR: have a single critical Job that's inside an Argo App, I'd like to rerun this Job every time there's a sync, and want Argo App to reflect the Job's health. Is there a way I can make Argo respect a "hooked" Job's health?

My App-of-apps Use case:

  • An Argo App (say, App N) that consists of a Single Job. Nothing else.
  • App N+1 must only proceed if App N succeeds - so I use App N's health (Healthy, Degraded, Progressing).

For context, this is because App N+1 deploys CRs based on CRDs App N introduces, among other things. So I want N+1 to not run if N fails.

  • My single idempotent Job must run every Sync - so naturally I use Argo Hooks <-- here is the issue

1 minute demo

Here are 2 versions of the app, one that is running my Job code without Hooks (left), and then with hooks (right):

ArgoCD-hook-fast

Demo Manifests

As you can see, the "non-hook" version's health is respected by Argo, but then I can't run the Job again because of K8s.

And the "hooked" version Job is rerunnable because I delete on HookSucceeded thanks to Argo, but the Job health is not respected by Argo.

So I'm between a rock and a hard place.

Workarounds

I tried adding in LUA healthchecks that targets batch/v1 GVK, no luck: mdrakiburrahman/openshift-app-of-apps@4b5ada0

If I deploy a dummy deployment at the end of my Sync Hook job, that does the trick for first time sync. But I can't go to production with that - because Argo will respect the deployment's health before going green (so my App N+1 doesn't fire).

I was also thinking of running a PreSync job with kubectl inside that deletes my Job that runs in a non-hook or "vanilla" manner, as a workaround because Argo will force the sync when it notices it is missing so I can rerun the Job per sync. This is gross.

Finally, if Kubernetes would just let me run the non-hook version of my Job multiple times, I would gladly do so, but it doesn't because the name is already there. I also can't use generateName because Kustomize doesn't support it yet.

Version

root ➜ /workspaces/openshift-vsphere-install (main ✗) $ argocd version
argocd: v2.4.3+471685f
  BuildDate: 2022-06-27T21:23:52Z
  GitCommit: 471685feae063c1c2e36a5ff268c4da87c697b85
  GitTreeState: clean
  GoVersion: go1.18.3
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.3.1+b65c169
  BuildDate: 2022-03-10T22:51:09Z
  GitCommit: b65c1699fa2a2daa031483a3890e6911eac69068
  GitTreeState: clean
  GoVersion: go1.17.6
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.0+gd141386
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0
@phyzical
Copy link
Contributor

phyzical commented Sep 9, 2022

👍 to this, i was hoping when a Job resource becomes unhealthy it would update the application to unhealthy to trigger a notification for use to action the issue.

I also tried the argocd-cm approach but it doesnt seem to have any affect on the apps health, it does change the health checks for the job though

resource.customizations.health.batch_Job: |
      hs = {}
      if obj.status ~= nil then
        if obj.status.conditions ~= nil then
          for i, condition in ipairs(obj.status.conditions) do
            if condition.type == "Ready" and condition.status == "False" then
              hs.status = "Degraded"
              hs.message = condition.message
              return hs
            end
            if condition.type == "Ready" and condition.status == "True" then
              hs.status = "Healthy"
              hs.message = condition.message
              return hs
            end
          end
        end
      end

      hs.status = "Progressing"
      hs.message = "Progressing"
      return hs

@mdrakiburrahman did you happen to have any luck with this since you posted?

@mdrakiburrahman
Copy link
Author

@phyzical I came up with a hack a "Job deleter Job". Basically, this:

I was also thinking of running a PreSync job with kubectl inside that deletes my Job that runs in a non-hook or "vanilla" manner, as a workaround because Argo will force the sync when it notices it is missing so I can rerun the Job per sync. This is gross.

Here's the code and the Kustomize overlay, feel free to use it: https://github.com/mdrakiburrahman/openshift-app-of-apps/blob/main/kube-arc-data-services-installer-job/kustomize/base/job-deleter-job.yaml#L12

@phyzical
Copy link
Contributor

@mdrakiburrahman thanks for the reply :)

sadly don't think this will serve our use case, its less about fixing a broken state job (well maybe this will be useful in the future less hands on 😆 ) But more about alerting us that this state has occurred in the first place.

Looking at your demo gifs its occurred to me that if we just used plain Job resources it probably would be working, but as we use CronJobs which create Jobs. the health of the Job doesn't propagate up to the CronJob/App it seems.

thanks again though

@jowko
Copy link

jowko commented Nov 21, 2022

We use ArgoCD in version 2.4.11.

We deploy helm chart which contains a few jobs with Sync hooks and they can take long time to process. We get info, that app is healthy and send emails to users after that when in fact our app is not ready yet. As a fix for now, we will also check if there is an syncing operation before doing any actions, since the operation state for sync is Running when sync jobs are running.

But it would be nice to fix it.

@LeszekBlazewski
Copy link

I have just been bitten by this pretty hard. Could someone point me to the source of the decision why hooks aren't influencing the overall health of the application even tho they prevent further Sync operation? It seems pretty counterintuitive. Especially on the UI, because even if your app crashed mid during the deployment, you still see a Healthy App on the UI and just a failed Sync (No easy way to filter for this).

Looks like only the initial solution with a ugly job-delete pod and ditching hooks altogether would work to cover all of the needs. Or my requirement could be covered by another tool like argocd rollouts or workflows?

@whoissteven-homelab
Copy link

+1, I am also affected by this issue.
It would be nice to have a job failure trigger the overall application's health to degrade.

@kshantaramanUFL
Copy link

+1. I am also looking to have this!

Currently we have an ArgoCD app which deploys a helm chart containing an ArgoRollout and a Statefulset. If the ArgoRollout is unhealthy and App health is marked as degrade. But we have a Job which runs PostSync to check the health of the statefulset for metrics emitted by the statefulset and if that job returns unhealthy we aren't seeing the AppHealth getting degraded!

Would love you hear from this forum on how to solve this!

@mkantzer
Copy link

mkantzer commented Aug 2, 2024

Also looking for any kind of answer for this.

@alexmt alexmt added bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality component:argo-cd type:bug labels Aug 2, 2024
@wadhwakabir
Copy link

if res.Live != nil && (hookutil.IsHook(res.Live) || ignore.Ignore(res.Live)) {

As part of code design in argocd health check, code is ignoring health of service is there is any hook bind to it.

image

This logic need to be changed to check health of any hooked resource.

@wadhwakabir
Copy link

@kshantaramanUFL this is expected behaviour as per code, if job has hooks , controller will ignore job health for application health.

// now mark the job as a hook and retry. it should ignore the hook and consider the app healthy

@wadhwakabir
Copy link

wadhwakabir commented Sep 17, 2024

@alexmt can we have some sort of flags in application controller to consider health of resources (that has hooks) for application health ?
Such as we have some global config to enable health monitoring of hooked resources, as we already have skip annotations to ignore the resource.

@jaytulshian1301
Copy link

I am also facing the same issue, Is there any workaround that I can use?

@wadhwakabir
Copy link

@jaytulshian1301 in code itself if resouces is hooked, health controller ignores health of it.
you can remove hooks to have that resource included for health check.

@jaytulshian1301
Copy link

My use case is different where I have to use a postSync hook to run a job after the app is synced. Is there something that I can do while keeping the hook?

@wadhwakabir
Copy link

wadhwakabir commented Oct 15, 2024

instead of using hooks, you can create init container to wait for status and run main container, hooks make job skip health check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality bug Something isn't working component:argo-cd type:bug
Projects
None yet
Development

No branches or pull requests

9 participants