Poststop lifecycle hook prevents deloyment status going healthy #9361

optiz0r · 2020-11-14T00:19:00Z

Nomad version

Nomad v1.0.0-beta3 (fcb32ef)

Operating system and Environment details

Nomad with docker 19.03.8

Issue

Nomad task group with poststop lifecycle hook and consul service registration does not transition to a healthy deployment status. After a long timeout the allocation transitions to unhealthy in nomad. The deployment itself spends a long time in the running state, and eventually transitions to failed. All the while, the consul health check is showing healthy.

Removing the poststart lifecycle hook makes the allocation transition immediately to healthy status.

Reproduction steps

Job file

job "gitlab-runner" {
    region = "global"
    datacenters = ["dc1"]

    type = "service"

    update {
        stagger = "30s"
        max_parallel = 1
    }

    group "ci" {
        count = 2

        constraint {
            operator = "distinct_hosts"
            value = "true"
        }

        network {
            port "metrics" {
                static = 9252
            }
        }

        service {
            name = "gitlab-runner"
            tags = ["metrics"]
            port = "metrics"
            check {
                name = "alive"
                type = "http"
                path = "/metrics"
                interval = "10s"
                timeout = "2s"
            }
        }

        task "cleanup-runners" {
            driver = "docker"

            lifecycle {
                hook = "prestart"
            }

            config {
                image = "gitlab/gitlab-runner:latest"
                args = [
                    "verify", "--delete",
                ]
                volumes = [
                    "${NOMAD_ALLOC_DIR}/gitlab-runner:/etc/gitlab-runner",
                ]
            }

            resources {
                cpu = 100 # MHz
                memory = 128 # MB
            }
        }

        task "register" {
            driver = "docker"

            lifecycle {
                hook = "prestart"
            }

            config {
                image = "gitlab/gitlab-runner:latest"
                args = [
                    "register", "-n"
                ]
                volumes = [
                    "/run/docker.sock:/var/run/docker.sock",
                    "${NOMAD_ALLOC_DIR}/gitlab-runner:/etc/gitlab-runner",
                ]
            }

            vault {
                policies = ["gitlab-runner"]
                change_mode = "signal"
                change_signal = "SIGHUP"
            }

            env {
                CI_SERVER_URL = "https://gitlab.example.com/"
                RUNNER_NAME = "${attr.unique.hostname}"
                REGISTER_RUN_UNTAGGED = "1"
                RUNNER_EXECUTOR = "docker"
                DOCKER_IMAGE = "alpine:latest"
                DOCKER_VOLUMES = "/var/run/docker.sock:/var/run/docker.sock"
                REGISTER_LOCKED = "0"
                CONCURRENT = "4"
            }

            template {
                destination = "secrets/registration_token.env"
                env = true
                data = <<EOT
                REGISTRATION_TOKEN="{{with secret "secret/apps/gitlab-runner/registration_token"}}{{.Data.data.token}}{{end}}"
                EOT
            }

            resources {
                cpu = 100 # MHz
                memory = 128 # MB
            }
        }

        task "runner" {
            driver = "docker"

            config {
                image = "gitlab/gitlab-runner:latest"
                args = [
                    "run"
                ]
                volumes = [
                    "/run/docker.sock:/var/run/docker.sock",
                    "${NOMAD_ALLOC_DIR}/gitlab-runner:/etc/gitlab-runner",
                ]
                ports = ["metrics"]
            }
            
            env {
                LISTEN_ADDRESS = "0.0.0.0:9252"
            }

            resources {
                cpu = 500 # MHz
                memory = 512 # MB
            }

        }

        task "unregister" {
            driver = "docker"

            lifecycle {
                hook = "poststop"
            }

            config {
                image = "gitlab/gitlab-runner:latest"
                args = [
                    "unregister", "--all-runners"
                ]
                volumes = [
                    "/run/docker.sock:/var/run/docker.sock",
                    "${NOMAD_ALLOC_DIR}/gitlab-runner:/etc/gitlab-runner",
                ]
            }

            env {
                CI_SERVER_URL = "https://gitlab.example.com/"
            }

            resources {
                cpu = 100 # MHz
                memory = 128 # MB
            }
        }
    }
}

Deployment status

# Some time after allocs have started
$ ~/bin/nomad deployment status -verbose 8367f485
ID          = 8367f485-19ae-b600-c76c-a42d017eb71a
Job ID      = gitlab-runner
Job Version = 0
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
ci          2        2       0        0          2020-11-14T00:26:15Z

The text was updated successfully, but these errors were encountered:

rockaut · 2020-11-30T12:56:00Z

Same here. Also on beta3 with different jobs.
Removing poststop and all is healthy again.

drewbailey · 2020-12-07T17:02:20Z

Thank you for the report. I've reproduced this and we are looking into it.

rockaut · 2020-12-07T17:54:50Z

@drewbailey nice ... if there's something to test just ping me. I'm always willing to tinker around. Well... at least if it fits on arm64 :D

* investigating where to ignore poststop task in alloc health tracker * ignore poststop when setting latest start time for allocation * clean up logic * lifecycle: isolate mocks for poststop deployment test * lifecycle: update comments in tracker Co-authored-by: Jasmine Dahilig <jasmine@dahilig.com>

jfabales · 2021-01-24T23:33:48Z

Hi,
Just noticed this happening to my jobs with poststart tasks and sidecar=false after upgrading to nomad v1.0.2. Allocations became unhealthy after poststart tasks have finished running, consul service checks all show healthy as well. Tasks all transition to healthy again after removing the poststart task.
So just wondering if the same needs to be done for poststart tasks?

Thanks!

havenith · 2021-11-07T21:24:21Z

I'm seeing this too on Nomad 1.1.6. We need poststart non-sidecar tasks to complete the deployment steps for us, so this is basically a showstopper

github-actions · 2022-10-14T02:45:41Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

galeep added the theme/task lifecycle label Nov 20, 2020

drewbailey added the type/bug label Dec 7, 2020

drewbailey mentioned this issue Dec 7, 2020

investigating where to ignore poststop task in alloc health tracker #9548

Merged

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Dec 17, 2020

jazzyfresh closed this as completed in #9548 Jan 12, 2021

jazzyfresh added a commit that referenced this issue Jan 12, 2021

changelog for #9361

224d3fc

drewbailey pushed a commit that referenced this issue Jan 12, 2021

changelog for #9361 (#9783)

75d5de9

backspace pushed a commit that referenced this issue Jan 22, 2021

changelog for #9361 (#9783)

ea75b26

tcdev0 mentioned this issue Feb 22, 2021

unhealthy deployment with poststart lifecycle #10058

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poststop lifecycle hook prevents deloyment status going healthy #9361

Poststop lifecycle hook prevents deloyment status going healthy #9361

optiz0r commented Nov 14, 2020

rockaut commented Nov 30, 2020 •

edited

Loading

drewbailey commented Dec 7, 2020

rockaut commented Dec 7, 2020 •

edited

Loading

jfabales commented Jan 24, 2021 •

edited

Loading

havenith commented Nov 7, 2021 •

edited

Loading

github-actions bot commented Oct 14, 2022

Poststop lifecycle hook prevents deloyment status going healthy #9361

Poststop lifecycle hook prevents deloyment status going healthy #9361

Comments

optiz0r commented Nov 14, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file

Deployment status

rockaut commented Nov 30, 2020 • edited Loading

drewbailey commented Dec 7, 2020

rockaut commented Dec 7, 2020 • edited Loading

jfabales commented Jan 24, 2021 • edited Loading

havenith commented Nov 7, 2021 • edited Loading

github-actions bot commented Oct 14, 2022

rockaut commented Nov 30, 2020 •

edited

Loading

rockaut commented Dec 7, 2020 •

edited

Loading

jfabales commented Jan 24, 2021 •

edited

Loading

havenith commented Nov 7, 2021 •

edited

Loading