ArgoCD app health does not respect Hooked Job's result, custom LUA check doesn't work either #9861

mdrakiburrahman · 2022-07-02T00:01:32Z

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq
I've included steps to reproduce the bug.
I've pasted the output of argocd version:

Describe the bug

TLDR: have a single critical Job that's inside an Argo App, I'd like to rerun this Job every time there's a sync, and want Argo App to reflect the Job's health. Is there a way I can make Argo respect a "hooked" Job's health?

My App-of-apps Use case:

An Argo App (say, App N) that consists of a Single Job. Nothing else.
App N+1 must only proceed if App N succeeds - so I use App N's health (Healthy, Degraded, Progressing).

For context, this is because App N+1 deploys CRs based on CRDs App N introduces, among other things. So I want N+1 to not run if N fails.

My single idempotent Job must run every Sync - so naturally I use Argo Hooks <-- here is the issue

1 minute demo

Here are 2 versions of the app, one that is running my Job code without Hooks (left), and then with hooks (right):

Demo Manifests

As you can see, the "non-hook" version's health is respected by Argo, but then I can't run the Job again because of K8s.

And the "hooked" version Job is rerunnable because I delete on HookSucceeded thanks to Argo, but the Job health is not respected by Argo.

So I'm between a rock and a hard place.

Workarounds

I tried adding in LUA healthchecks that targets batch/v1 GVK, no luck: mdrakiburrahman/openshift-app-of-apps@4b5ada0

If I deploy a dummy deployment at the end of my Sync Hook job, that does the trick for first time sync. But I can't go to production with that - because Argo will respect the deployment's health before going green (so my App N+1 doesn't fire).

I was also thinking of running a PreSync job with kubectl inside that deletes my Job that runs in a non-hook or "vanilla" manner, as a workaround because Argo will force the sync when it notices it is missing so I can rerun the Job per sync. This is gross.

Finally, if Kubernetes would just let me run the non-hook version of my Job multiple times, I would gladly do so, but it doesn't because the name is already there. I also can't use generateName because Kustomize doesn't support it yet.

Version

root ➜ /workspaces/openshift-vsphere-install (main ✗) $ argocd version
argocd: v2.4.3+471685f
  BuildDate: 2022-06-27T21:23:52Z
  GitCommit: 471685feae063c1c2e36a5ff268c4da87c697b85
  GitTreeState: clean
  GoVersion: go1.18.3
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.3.1+b65c169
  BuildDate: 2022-03-10T22:51:09Z
  GitCommit: b65c1699fa2a2daa031483a3890e6911eac69068
  GitTreeState: clean
  GoVersion: go1.17.6
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.0+gd141386
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

The text was updated successfully, but these errors were encountered:

phyzical · 2022-09-09T08:33:51Z

👍 to this, i was hoping when a Job resource becomes unhealthy it would update the application to unhealthy to trigger a notification for use to action the issue.

I also tried the argocd-cm approach but it doesnt seem to have any affect on the apps health, it does change the health checks for the job though

resource.customizations.health.batch_Job: |
      hs = {}
      if obj.status ~= nil then
        if obj.status.conditions ~= nil then
          for i, condition in ipairs(obj.status.conditions) do
            if condition.type == "Ready" and condition.status == "False" then
              hs.status = "Degraded"
              hs.message = condition.message
              return hs
            end
            if condition.type == "Ready" and condition.status == "True" then
              hs.status = "Healthy"
              hs.message = condition.message
              return hs
            end
          end
        end
      end

      hs.status = "Progressing"
      hs.message = "Progressing"
      return hs

@mdrakiburrahman did you happen to have any luck with this since you posted?

mdrakiburrahman · 2022-09-13T11:20:46Z

@phyzical I came up with a hack a "Job deleter Job". Basically, this:

I was also thinking of running a PreSync job with kubectl inside that deletes my Job that runs in a non-hook or "vanilla" manner, as a workaround because Argo will force the sync when it notices it is missing so I can rerun the Job per sync. This is gross.

Here's the code and the Kustomize overlay, feel free to use it: https://github.com/mdrakiburrahman/openshift-app-of-apps/blob/main/kube-arc-data-services-installer-job/kustomize/base/job-deleter-job.yaml#L12

phyzical · 2022-09-13T23:27:56Z

@mdrakiburrahman thanks for the reply :)

sadly don't think this will serve our use case, its less about fixing a broken state job (well maybe this will be useful in the future less hands on 😆 ) But more about alerting us that this state has occurred in the first place.

Looking at your demo gifs its occurred to me that if we just used plain Job resources it probably would be working, but as we use CronJobs which create Jobs. the health of the Job doesn't propagate up to the CronJob/App it seems.

thanks again though

jowko · 2022-11-21T13:48:49Z

We use ArgoCD in version 2.4.11.

We deploy helm chart which contains a few jobs with Sync hooks and they can take long time to process. We get info, that app is healthy and send emails to users after that when in fact our app is not ready yet. As a fix for now, we will also check if there is an syncing operation before doing any actions, since the operation state for sync is Running when sync jobs are running.

But it would be nice to fix it.

LeszekBlazewski · 2023-08-17T15:26:39Z

I have just been bitten by this pretty hard. Could someone point me to the source of the decision why hooks aren't influencing the overall health of the application even tho they prevent further Sync operation? It seems pretty counterintuitive. Especially on the UI, because even if your app crashed mid during the deployment, you still see a Healthy App on the UI and just a failed Sync (No easy way to filter for this).

Looks like only the initial solution with a ugly job-delete pod and ditching hooks altogether would work to cover all of the needs. Or my requirement could be covered by another tool like argocd rollouts or workflows?

whoissteven-homelab · 2024-01-04T23:26:11Z

+1, I am also affected by this issue.
It would be nice to have a job failure trigger the overall application's health to degrade.

kshantaramanUFL · 2024-01-26T18:35:36Z

+1. I am also looking to have this!

Currently we have an ArgoCD app which deploys a helm chart containing an ArgoRollout and a Statefulset. If the ArgoRollout is unhealthy and App health is marked as degrade. But we have a Job which runs PostSync to check the health of the statefulset for metrics emitted by the statefulset and if that job returns unhealthy we aren't seeing the AppHealth getting degraded!

Would love you hear from this forum on how to solve this!

mkantzer · 2024-08-02T15:06:42Z

Also looking for any kind of answer for this.

wadhwakabir · 2024-09-17T04:17:48Z

argo-cd/controller/health.go

Line 28 in 14a1a55

if res.Live != nil && (hookutil.IsHook(res.Live) || ignore.Ignore(res.Live)) {

As part of code design in argocd health check, code is ignoring health of service is there is any hook bind to it.

This logic need to be changed to check health of any hooked resource.

wadhwakabir · 2024-09-17T11:54:18Z

@kshantaramanUFL this is expected behaviour as per code, if job has hooks , controller will ignore job health for application health.

argo-cd/controller/health_test.go

Line 63 in 20f9719

    
           // now mark the job as a hook and retry. it should ignore the hook and consider the app healthy

wadhwakabir · 2024-09-17T12:00:57Z

@alexmt can we have some sort of flags in application controller to consider health of resources (that has hooks) for application health ?
Such as we have some global config to enable health monitoring of hooked resources, as we already have skip annotations to ignore the resource.

jaytulshian1301 · 2024-09-23T05:16:15Z

I am also facing the same issue, Is there any workaround that I can use?

wadhwakabir · 2024-09-23T08:42:13Z

@jaytulshian1301 in code itself if resouces is hooked, health controller ignores health of it.
you can remove hooks to have that resource included for health check.

jaytulshian1301 · 2024-09-23T12:46:21Z

My use case is different where I have to use a postSync hook to run a job after the app is synced. Is there something that I can do while keeping the hook?

wadhwakabir · 2024-10-15T05:20:30Z

instead of using hooks, you can create init container to wait for status and run main container, hooks make job skip health check.

mdrakiburrahman added the bug Something isn't working label Jul 2, 2022

nazarewk mentioned this issue Jul 22, 2022

Argo CD considers PreSync phase finished even though the Job was just created #10077

Open

3 tasks

alexmt added bug/severity:criticial A critical bug in ArgoCD, possibly resulting in data loss or severe degraded overall functionality component:argo-cd type:bug labels Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArgoCD app health does not respect Hooked Job's result, custom LUA check doesn't work either #9861

ArgoCD app health does not respect Hooked Job's result, custom LUA check doesn't work either #9861

mdrakiburrahman commented Jul 2, 2022

phyzical commented Sep 9, 2022 •

edited

Loading

mdrakiburrahman commented Sep 13, 2022

phyzical commented Sep 13, 2022

jowko commented Nov 21, 2022

LeszekBlazewski commented Aug 17, 2023

whoissteven-homelab commented Jan 4, 2024

kshantaramanUFL commented Jan 26, 2024

mkantzer commented Aug 2, 2024

wadhwakabir commented Sep 17, 2024

wadhwakabir commented Sep 17, 2024

wadhwakabir commented Sep 17, 2024 •

edited

Loading

jaytulshian1301 commented Sep 23, 2024

wadhwakabir commented Sep 23, 2024

jaytulshian1301 commented Sep 23, 2024

wadhwakabir commented Oct 15, 2024 •

edited

Loading

ArgoCD app health does not respect Hooked Job's result, custom LUA check doesn't work either #9861

ArgoCD app health does not respect Hooked Job's result, custom LUA check doesn't work either #9861

Comments

mdrakiburrahman commented Jul 2, 2022

Describe the bug

1 minute demo

Workarounds

Version

phyzical commented Sep 9, 2022 • edited Loading

mdrakiburrahman commented Sep 13, 2022

phyzical commented Sep 13, 2022

jowko commented Nov 21, 2022

LeszekBlazewski commented Aug 17, 2023

whoissteven-homelab commented Jan 4, 2024

kshantaramanUFL commented Jan 26, 2024

mkantzer commented Aug 2, 2024

wadhwakabir commented Sep 17, 2024

wadhwakabir commented Sep 17, 2024

wadhwakabir commented Sep 17, 2024 • edited Loading

jaytulshian1301 commented Sep 23, 2024

wadhwakabir commented Sep 23, 2024

jaytulshian1301 commented Sep 23, 2024

wadhwakabir commented Oct 15, 2024 • edited Loading

phyzical commented Sep 9, 2022 •

edited

Loading

wadhwakabir commented Sep 17, 2024 •

edited

Loading

wadhwakabir commented Oct 15, 2024 •

edited

Loading