Both tasks marked as unhealthy if only one fails. #9254

erichulburd · 2020-11-02T23:00:44Z

Nomad version

0.12.5

Operating system and Environment details

MacOS Catalina 10.15.4, but I'm experiencing similar issues running through OpenStack.

Issue

When a task group has 2+ tasks and one fails, both tasks are marked unhealthy with Task not running for min_healthy_time of 10s by deadline in the UI (see reproduction steps below). Maybe this is intentional for some good reason, but the UX is confusing regardless.

Reproduction steps

Please see source and instructions in erichulburd/consul-nomad-issues.

Job file (if appropriate)

erichulburd/consul-nomad-issues/job.json.

The text was updated successfully, but these errors were encountered:

tgross · 2020-12-16T21:17:17Z

Hi @erichulburd! So Nomad is behaving correctly here but the way it's being presented could probably be more clear.

Here's a more minimal reproduction for my colleagues on the UI team, so that they can see this for themselves:

Run Nomad and Consul (dev mode is fine for both)
Run the job below: nomad job run ./example.nomad

jobspec

job "example" {
  datacenters = ["dc1"]

  update {
    health_check      = "checks"
    min_healthy_time  = "10s"
    healthy_deadline  = "1m"
    progress_deadline = "2m"
  }

  group "webservers" {
    network {
      port "www1" {
        to = 8080
      }

      port "www2" {
        to = 8081
      }
    }

    task "example-1" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8080", "-h", "${NOMAD_TASK_DIR}"]
        ports   = ["www1"]
      }

      template {
        data        = "<html>ok</html>"
        destination = "local/index.html"
      }

      service {
        check {
          port     = "www1"
          type     = "http"
          path     = "/"
          interval = "2s"
          timeout  = "1s"
        }
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }


    task "example-2" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8081", "-h", "${NOMAD_TASK_DIR}"]
        ports   = ["www2"]
      }

      template {
        data        = "<html>ok</html>"
        destination = "local/index.html"
      }

      # this health check will never pass
      service {
        check {
          port     = "www2"
          type     = "http"
          path     = "/does-not-exist"
          interval = "2s"
          timeout  = "1s"
        }
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }

  }
}

Modify the job so that the allocation has to be updated (changing cpu = 501 in the resources block will do the trick).
Wait until the deployment fails (~2 min).

$ nomad deployment status d44
ID          = d4408e56
Job ID      = example
Job Version = 1
Status      = failed
Description = Failed due to progress deadline

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
webservers  1        1       0        1          2020-12-16T21:08:56Z

At this point the allocation status will look like this. Note that we have an Alloc Unhealthy event, with the description "Task not running for min_healthy_time of 10s by deadline". If you take a look at the Nomad UI, you'll see the same thing shown there.

$ nomad alloc status 6cd
...

Task "example-1" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  36 KiB/256 MiB  300 MiB

Task Events:
Started At     = 2020-12-16T21:07:03Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type             Description
2020-12-16T16:08:02-05:00  Alloc Unhealthy  Task not running for min_healthy_time of 10s by deadline
2020-12-16T16:07:03-05:00  Started          Task started by client
2020-12-16T16:07:02-05:00  Task Setup       Building Task Directory
2020-12-16T16:06:56-05:00  Received         Task received by client

Task "example-2" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  40 KiB/256 MiB  300 MiB

Task Events:
Started At     = 2020-12-16T21:07:03Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type             Description
2020-12-16T16:08:02-05:00  Alloc Unhealthy  Task not running for min_healthy_time of 10s by deadline
2020-12-16T16:07:03-05:00  Started          Task started by client
2020-12-16T16:07:02-05:00  Task Setup       Building Task Directory
2020-12-16T16:06:56-05:00  Received         Task received by client

The Nomad client is doing the right thing here: the deployment was not unhealthy by the progress deadline, so the whole allocation is marked as unhealthy. The way this is presented in the CLI is a little weird but because we see the Type I feel like that makes it a bit more clear. But in the web UI we don't have the Type field and maybe that's why it looks misleading?

DingoEatingFuzz · 2021-01-27T18:44:30Z

The way this is presented in the CLI is a little weird but because we see the Type I feel like that makes it a bit more clear.

It still seems weird to me that both tasks get the event? I suppose the event truly belongs to the alloc but there are no alloc events so this is the workaround?

But in the web UI we don't have the Type field and maybe that's why it looks misleading?

The type is shown if you drill into the task detail page, but we can probably make this clearer on the alloc detail page too.

crestonbunch · 2021-12-22T04:07:53Z

I'm seeing this behavior when I have a task with

lifecycle {
  hook    = "poststart"
  sidecar = false
}

Even when the task exits successfully with status code 0 and everything else is fine, the alloc is marked as unhealthy.

This seems like a bug to me, or at least something that needs to have a workaround. If I have a task that is meant to do some kind of initialization then immediately exit, that shouldn't block the other tasks from being considered healthy.

Edit. It seems like this issue is what I'm describing: #10058

github-actions · 2022-10-12T02:44:12Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

backspace added stage/needs-investigation theme/ui type/bug labels Nov 2, 2020

tgross added stage/needs-discussion and removed stage/needs-investigation labels Dec 16, 2020

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

beautifulentropy mentioned this issue Jan 27, 2022

Fix health checking for ephemeral poststart tasks #11945

Merged

tgross closed this as completed in #11945 Feb 2, 2022

This was referenced Apr 19, 2022

Backport of Fix health checking for ephemeral poststart tasks into release/1.1.x #12615

Merged

Backport of Fix health checking for ephemeral poststart tasks into release/1.2.x #12616

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Both tasks marked as unhealthy if only one fails. #9254

Both tasks marked as unhealthy if only one fails. #9254

erichulburd commented Nov 2, 2020

tgross commented Dec 16, 2020

DingoEatingFuzz commented Jan 27, 2021

crestonbunch commented Dec 22, 2021 •

edited

Loading

github-actions bot commented Oct 12, 2022

Both tasks marked as unhealthy if only one fails. #9254

Both tasks marked as unhealthy if only one fails. #9254

Comments

erichulburd commented Nov 2, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

tgross commented Dec 16, 2020

DingoEatingFuzz commented Jan 27, 2021

crestonbunch commented Dec 22, 2021 • edited Loading

github-actions bot commented Oct 12, 2022

crestonbunch commented Dec 22, 2021 •

edited

Loading