Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Both tasks marked as unhealthy if only one fails. #9254

Closed
erichulburd opened this issue Nov 2, 2020 · 4 comments · Fixed by #11945
Closed

Both tasks marked as unhealthy if only one fails. #9254

erichulburd opened this issue Nov 2, 2020 · 4 comments · Fixed by #11945

Comments

@erichulburd
Copy link

Nomad version

0.12.5

Operating system and Environment details

MacOS Catalina 10.15.4, but I'm experiencing similar issues running through OpenStack.

Issue

When a task group has 2+ tasks and one fails, both tasks are marked unhealthy with Task not running for min_healthy_time of 10s by deadline in the UI (see reproduction steps below). Maybe this is intentional for some good reason, but the UX is confusing regardless.

nomad-both-tasks-fail

Reproduction steps

Please see source and instructions in erichulburd/consul-nomad-issues.

Job file (if appropriate)

erichulburd/consul-nomad-issues/job.json.

@tgross
Copy link
Member

tgross commented Dec 16, 2020

Hi @erichulburd! So Nomad is behaving correctly here but the way it's being presented could probably be more clear.

Here's a more minimal reproduction for my colleagues on the UI team, so that they can see this for themselves:

  • Run Nomad and Consul (dev mode is fine for both)
  • Run the job below: nomad job run ./example.nomad
jobspec
job "example" {
  datacenters = ["dc1"]

  update {
    health_check      = "checks"
    min_healthy_time  = "10s"
    healthy_deadline  = "1m"
    progress_deadline = "2m"
  }

  group "webservers" {
    network {
      port "www1" {
        to = 8080
      }

      port "www2" {
        to = 8081
      }
    }

    task "example-1" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8080", "-h", "${NOMAD_TASK_DIR}"]
        ports   = ["www1"]
      }

      template {
        data        = "<html>ok</html>"
        destination = "local/index.html"
      }

      service {
        check {
          port     = "www1"
          type     = "http"
          path     = "/"
          interval = "2s"
          timeout  = "1s"
        }
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }


    task "example-2" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8081", "-h", "${NOMAD_TASK_DIR}"]
        ports   = ["www2"]
      }

      template {
        data        = "<html>ok</html>"
        destination = "local/index.html"
      }

      # this health check will never pass
      service {
        check {
          port     = "www2"
          type     = "http"
          path     = "/does-not-exist"
          interval = "2s"
          timeout  = "1s"
        }
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }

  }
}
  • Modify the job so that the allocation has to be updated (changing cpu = 501 in the resources block will do the trick).
  • Wait until the deployment fails (~2 min).
$ nomad deployment status d44
ID          = d4408e56
Job ID      = example
Job Version = 1
Status      = failed
Description = Failed due to progress deadline

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
webservers  1        1       0        1          2020-12-16T21:08:56Z

At this point the allocation status will look like this. Note that we have an Alloc Unhealthy event, with the description "Task not running for min_healthy_time of 10s by deadline". If you take a look at the Nomad UI, you'll see the same thing shown there.

$ nomad alloc status 6cd
...

Task "example-1" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  36 KiB/256 MiB  300 MiB

Task Events:
Started At     = 2020-12-16T21:07:03Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type             Description
2020-12-16T16:08:02-05:00  Alloc Unhealthy  Task not running for min_healthy_time of 10s by deadline
2020-12-16T16:07:03-05:00  Started          Task started by client
2020-12-16T16:07:02-05:00  Task Setup       Building Task Directory
2020-12-16T16:06:56-05:00  Received         Task received by client

Task "example-2" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  40 KiB/256 MiB  300 MiB

Task Events:
Started At     = 2020-12-16T21:07:03Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type             Description
2020-12-16T16:08:02-05:00  Alloc Unhealthy  Task not running for min_healthy_time of 10s by deadline
2020-12-16T16:07:03-05:00  Started          Task started by client
2020-12-16T16:07:02-05:00  Task Setup       Building Task Directory
2020-12-16T16:06:56-05:00  Received         Task received by client

The Nomad client is doing the right thing here: the deployment was not unhealthy by the progress deadline, so the whole allocation is marked as unhealthy. The way this is presented in the CLI is a little weird but because we see the Type I feel like that makes it a bit more clear. But in the web UI we don't have the Type field and maybe that's why it looks misleading?

@DingoEatingFuzz
Copy link
Contributor

The way this is presented in the CLI is a little weird but because we see the Type I feel like that makes it a bit more clear.

It still seems weird to me that both tasks get the event? I suppose the event truly belongs to the alloc but there are no alloc events so this is the workaround?

But in the web UI we don't have the Type field and maybe that's why it looks misleading?

The type is shown if you drill into the task detail page, but we can probably make this clearer on the alloc detail page too.

@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021
@crestonbunch
Copy link

crestonbunch commented Dec 22, 2021

I'm seeing this behavior when I have a task with

lifecycle {
  hook    = "poststart"
  sidecar = false
}

Even when the task exits successfully with status code 0 and everything else is fine, the alloc is marked as unhealthy.

This seems like a bug to me, or at least something that needs to have a workaround. If I have a task that is meant to do some kind of initialization then immediately exit, that shouldn't block the other tasks from being considered healthy.

Edit. It seems like this issue is what I'm describing: #10058

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants