Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade #7320

dpn · 2020-03-11T18:07:40Z

Nomad version

Found on

$ nomad version
Nomad v0.9.6 (1f8eddf2211d064b150f141c86e30d9fceabec89)

Also repros on these version in our test clusters

$ nomad version
Nomad v0.9.7 (0e0eb07c53f99f54bcdb2e69aa8a9690a0597e7a)

$ nomad version
Nomad v0.10.4 (f750636ca68e17dcd2445c1ab9c5a34f9ac69345)

Operating system and Environment details

Originally found in AWS:

3x Nomad Servers (v0.9.6)
On the order of 100s of Nomad clients
Consul v1.6.3

Reproduced on colocated hardware:

3x Nomad servers (v0.9.7 and v0.10.4)
On the order of 100s of Nomad clients
Consul v1.6.3

Issue

The issue was discovered when one of our engineers pushed out a deployment where the replacement allocs were failing their healthchecks due to improperly configured Security Groups in AWS, yet Nomad continued to replace the healthy allocs with unhealthy ones until the entire service was down.

In the repro steps it seems that Nomad thinks these replacement allocations are healthy when they're not and this seems to be triggered when the replacement alloc is restarted by the service CheckRestart stanza. Another thing to note is that this doesn't reproduce with a single task job, multiple tasks are required for this behavior.

I'm not seeing any issues with the config that would lead to this behavior, but entirely possible I've overlooked something.

Reproduction steps

Submit stable job

$ curl -v -X PUT -H "X-Nomad-Token: $NOMAD_TOKEN" -d @test-nomad-rolling-upgrade-ok.json https://$HOSTNAME:4646/v1/job/test-nomad-rolling-deployments

Wait for initial deployment to succeed:

ID            = test-nomad-rolling-deployments
Name          = test-nomad-rolling-deployments
Submit Date   = 2020-03-11T17:07:36Z
Type          = service
Priority      = 50
Datacenters   = a-dc
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group                      Queued  Starting  Running  Failed  Complete  Lost
test-nomad-rolling-deployments  0       0         3        0       0         0

Latest Deployment
ID          = b0343364
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group                      Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
test-nomad-rolling-deployments  true         3        3       3        0          2020-03-11T11:24:02-06:00

Allocations
ID        Node ID   Task Group                      Version  Desired  Status   Created  Modified
1081ddfd  cced5c02  test-nomad-rolling-deployments  0        run      running  38s ago  12s ago
a2759873  0faae4b3  test-nomad-rolling-deployments  0        run      running  38s ago  17s ago
ba6b37bc  ed21263f  test-nomad-rolling-deployments  0        run      running  38s ago  18s ago

Modify the job file to tweak the cpu allocation (this forces a new deployment, simulating a Docker image version bump) and break the healthcheck on one of the allocations by tweaking the healthcheck path

$ diff test-nomad-rolling-upgrade-ok.json test-nomad-rolling-upgrade-not-ok.json
18c18
<                             "CPU": 48,
---
>                             "CPU": 24,
42c42
<                                         "Path": "/healthcheck-ok",
---
>                                         "Path": "/healthcheck-not-ok",

Submit the updated job

$ curl -v -X PUT -H "X-Nomad-Token: $NOMAD_TOKEN" -d @test-nomad-rolling-upgrade-not-ok.json  https://$HOSTNAME:4646/v1/job/test-nomad-rolling-deployments

Deployment begins by creating a new alloc

ID            = test-nomad-rolling-deployments
Name          = test-nomad-rolling-deployments
Submit Date   = 2020-03-11T17:17:33Z
Type          = service
Priority      = 50
Datacenters   = a-dc
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group                      Queued  Starting  Running  Failed  Complete  Lost
test-nomad-rolling-deployments  0       0         3        0       1         0

Latest Deployment
ID          = d76e2a0e
Status      = running
Description = Deployment is running

Deployed
Task Group                      Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
test-nomad-rolling-deployments  true         3        1       0        0          2020-03-11T17:33:34Z

Allocations
ID        Node ID   Task Group                      Version  Desired  Status    Created     Modified
cf26ce6c  c38e9054  test-nomad-rolling-deployments  1        run      running   24s ago     13s ago
1081ddfd  cced5c02  test-nomad-rolling-deployments  0        stop     complete  10m22s ago  18s ago
a2759873  0faae4b3  test-nomad-rolling-deployments  0        run      running   10m22s ago  10m1s ago
ba6b37bc  ed21263f  test-nomad-rolling-deployments  0        run      running   10m22s ago  10m2s ago

CheckRestart stanza takes effect, restarting the new alloc:

ID                  = cf26ce6c
Eval ID             = 187a10ba
Name                = test-nomad-rolling-deployments.test-nomad-rolling-deployments[0]
Node ID             = c38e9054
Node Name           = a.node.tld
Job ID              = test-nomad-rolling-deployments
Job Version         = 824638703088
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 1m49s ago
Modified            = 23s ago
Deployment ID       = d76e2a0e
Deployment Health   = healthy

Task "main" is "running"
Task Resources
CPU       Memory          Disk     Addresses
0/24 MHz  1.2 MiB/32 MiB  300 MiB  main_port: 10.4.26.15:23739

Task Events:
Started At     = 2020-03-11T17:18:59Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2020-03-11T11:18:41-06:00

Recent Events:
Time                  Type              Description
2020-03-11T17:18:59Z  Started           Task started by client
2020-03-11T17:18:57Z  Driver            Downloading image
2020-03-11T17:18:41Z  Restarting        Task restarting in 16.008336604s
2020-03-11T17:18:41Z  Terminated        Exit Code: 0
2020-03-11T17:18:35Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy
2020-03-11T17:17:44Z  Started           Task started by client
2020-03-11T17:17:40Z  Driver            Downloading image
2020-03-11T17:17:40Z  Task Setup        Building Task Directory
2020-03-11T17:17:34Z  Received          Task received by client

Task "secondary" is "running"
Task Resources
CPU       Memory          Disk     Addresses
0/48 MHz  1.2 MiB/32 MiB  300 MiB  secondary_port: 10.4.26.15:29470

Task Events:
Started At     = 2020-03-11T17:17:44Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2020-03-11T17:17:44Z  Started     Task started by client
2020-03-11T17:17:40Z  Driver      Downloading image
2020-03-11T17:17:40Z  Task Setup  Building Task Directory
2020-03-11T17:17:34Z  Received    Task received by client

Nomad schedules a new allocation with the new job spec and tears down one of the old allocations, essentially continuing the deployment even though the healthchecks on the new allocs are still unhealthy. This is the behavior we're confused about:

ID            = test-nomad-rolling-deployments
Name          = test-nomad-rolling-deployments
Submit Date   = 2020-03-11T17:17:33Z
Type          = service
Priority      = 50
Datacenters   = a-dc
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group                      Queued  Starting  Running  Failed  Complete  Lost
test-nomad-rolling-deployments  0       0         3        0       2         0

Latest Deployment
ID          = d76e2a0e
Status      = running
Description = Deployment is running

Deployed
Task Group                      Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
test-nomad-rolling-deployments  true         3        2       1        0          2020-03-11T11:34:46-06:00

Allocations
ID        Node ID   Task Group                      Version  Desired  Status    Created     Modified
116e086b  cced5c02  test-nomad-rolling-deployments  1        run      running   32s ago     19s ago
cf26ce6c  c38e9054  test-nomad-rolling-deployments  1        run      running   1m46s ago   20s ago
1081ddfd  cced5c02  test-nomad-rolling-deployments  0        stop     complete  11m44s ago  1m40s ago
a2759873  0faae4b3  test-nomad-rolling-deployments  0        stop     complete  11m44s ago  26s ago
ba6b37bc  ed21263f  test-nomad-rolling-deployments  0        run      running   11m44s ago  11m24s ago

This continues until all healthy allocs are gone, replaced by unhealthy ones (although Nomad incorrectly thinks they're healthy):

ID            = test-nomad-rolling-deployments
Name          = test-nomad-rolling-deployments
Submit Date   = 2020-03-11T17:17:33Z
Type          = service
Priority      = 50
Datacenters   = a-dc
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group                      Queued  Starting  Running  Failed  Complete  Lost
test-nomad-rolling-deployments  0       0         3        0       3         0

Latest Deployment
ID          = d76e2a0e
Status      = running
Description = Deployment is running

Deployed
Task Group                      Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
test-nomad-rolling-deployments  true         3        3       2        0          2020-03-11T11:36:02-06:00

Allocations
ID        Node ID   Task Group                      Version  Desired  Status    Created     Modified
1f4c25c1  f6f3bea0  test-nomad-rolling-deployments  1        run      running   52s ago     39s ago
116e086b  cced5c02  test-nomad-rolling-deployments  1        run      running   2m8s ago    38s ago
cf26ce6c  c38e9054  test-nomad-rolling-deployments  1        run      running   3m22s ago   40s ago
1081ddfd  cced5c02  test-nomad-rolling-deployments  0        stop     complete  13m20s ago  3m16s ago
a2759873  0faae4b3  test-nomad-rolling-deployments  0        stop     complete  13m20s ago  2m2s ago
ba6b37bc  ed21263f  test-nomad-rolling-deployments  0        stop     complete  13m20s ago  46s ago

State of new allocs after deployment completes:

ID                  = 1f4c25c1
Eval ID             = 9ac16764
Name                = test-nomad-rolling-deployments.test-nomad-rolling-deployments[2]
Node ID             = f6f3bea0
Node Name           = a.node.tld
Job ID              = test-nomad-rolling-deployments
Job Version         = 824637845280
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 1m56s ago
Modified            = 28s ago
Deployment ID       = d76e2a0e
Deployment Health   = healthy

Task "main" is "running"
Task Resources
CPU       Memory          Disk     Addresses
0/24 MHz  1.2 MiB/32 MiB  300 MiB  main_port: 10.4.22.187:21039

Task Events:
Started At     = 2020-03-11T17:21:31Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2020-03-11T11:21:12-06:00

Recent Events:
Time                  Type              Description
2020-03-11T17:21:31Z  Started           Task started by client
2020-03-11T17:21:29Z  Driver            Downloading image
2020-03-11T17:21:12Z  Restarting        Task restarting in 17.009193127s
2020-03-11T17:21:12Z  Terminated        Exit Code: 0
2020-03-11T17:21:07Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy
2020-03-11T17:20:16Z  Started           Task started by client
2020-03-11T17:20:09Z  Driver            Downloading image
2020-03-11T17:20:09Z  Task Setup        Building Task Directory
2020-03-11T17:20:03Z  Received          Task received by client

Task "secondary" is "running"
Task Resources
CPU       Memory          Disk     Addresses
0/48 MHz  1.2 MiB/32 MiB  300 MiB  secondary_port: 10.4.22.187:27842

Task Events:
Started At     = 2020-03-11T17:20:16Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2020-03-11T17:20:16Z  Started     Task started by client
2020-03-11T17:20:09Z  Driver      Downloading image
2020-03-11T17:20:09Z  Task Setup  Building Task Directory
2020-03-11T17:20:03Z  Received    Task received by client

ID                  = 116e086b
Eval ID             = 515b5394
Name                = test-nomad-rolling-deployments.test-nomad-rolling-deployments[1]
Node ID             = cced5c02
Node Name           = a.node.tld
Job ID              = test-nomad-rolling-deployments
Job Version         = 824635642096
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 3m18s ago
Modified            = 31s ago
Deployment ID       = d76e2a0e
Deployment Health   = healthy

Task "main" is "running"
Task Resources
CPU       Memory          Disk     Addresses
0/24 MHz  1.2 MiB/32 MiB  300 MiB  main_port: 10.4.26.52:27202

Task Events:
Started At     = 2020-03-11T17:21:34Z
Finished At    = N/A
Total Restarts = 2
Last Restart   = 2020-03-11T11:21:15-06:00

Recent Events:
Time                  Type              Description
2020-03-11T17:21:34Z  Started           Task started by client
2020-03-11T17:21:32Z  Driver            Downloading image
2020-03-11T17:21:15Z  Restarting        Task restarting in 17.237065933s
2020-03-11T17:21:15Z  Terminated        Exit Code: 0
2020-03-11T17:21:09Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy
2020-03-11T17:20:17Z  Started           Task started by client
2020-03-11T17:20:15Z  Driver            Downloading image
2020-03-11T17:19:57Z  Restarting        Task restarting in 18.136251856s
2020-03-11T17:19:57Z  Terminated        Exit Code: 0
2020-03-11T17:19:51Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy

Task "secondary" is "running"
Task Resources
CPU       Memory          Disk     Addresses
0/48 MHz  1.2 MiB/32 MiB  300 MiB  secondary_port: 10.4.26.52:31300

Task Events:
Started At     = 2020-03-11T17:19:00Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2020-03-11T17:19:00Z  Started     Task started by client
2020-03-11T17:18:54Z  Driver      Downloading image
2020-03-11T17:18:54Z  Task Setup  Building Task Directory
2020-03-11T17:18:47Z  Received    Task received by client

ID                  = cf26ce6c
Eval ID             = 187a10ba
Name                = test-nomad-rolling-deployments.test-nomad-rolling-deployments[0]
Node ID             = c38e9054
Node Name           = a.node.tld
Job ID              = test-nomad-rolling-deployments
Job Version         = 824635662768
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 4m36s ago
Modified            = 39s ago
Deployment ID       = d76e2a0e
Deployment Health   = healthy

Task "main" is "running"
Task Resources
CPU       Memory          Disk     Addresses
0/24 MHz  1.2 MiB/32 MiB  300 MiB  main_port: 10.4.26.15:23739

Task Events:
Started At     = 2020-03-11T17:21:30Z
Finished At    = N/A
Total Restarts = 3
Last Restart   = 2020-03-11T11:21:12-06:00

Recent Events:
Time                  Type              Description
2020-03-11T17:21:30Z  Started           Task started by client
2020-03-11T17:21:28Z  Driver            Downloading image
2020-03-11T17:21:12Z  Restarting        Task restarting in 16.057916863s
2020-03-11T17:21:12Z  Terminated        Exit Code: 0
2020-03-11T17:21:07Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy
2020-03-11T17:20:15Z  Started           Task started by client
2020-03-11T17:20:13Z  Driver            Downloading image
2020-03-11T17:19:56Z  Restarting        Task restarting in 17.64927063s
2020-03-11T17:19:56Z  Terminated        Exit Code: 0
2020-03-11T17:19:50Z  Restart Signaled  healthcheck: check "healthcheck" unhealthy

Task "secondary" is "running"
Task Resources
CPU       Memory          Disk     Addresses
0/48 MHz  1.2 MiB/32 MiB  300 MiB  secondary_port: 10.4.26.15:29470

Task Events:
Started At     = 2020-03-11T17:17:44Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2020-03-11T17:17:44Z  Started     Task started by client
2020-03-11T17:17:40Z  Driver      Downloading image
2020-03-11T17:17:40Z  Task Setup  Building Task Directory
2020-03-11T17:17:34Z  Received    Task received by client

Final state of deployment

ID          = d76e2a0e
Job ID      = test-nomad-rolling-deployments
Job Version = 1
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group                      Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
test-nomad-rolling-deployments  true         3        3       3        0          2020-03-11T11:37:17-06:00

Job file (if appropriate)

{
    "Job": {
        "ID": "test-nomad-rolling-deployments",
        "Name": "test-nomad-rolling-deployments",
        "Type": "service",
        "Priority": 50,
        "Region": "a-region",
        "DataCenters": [
            "a-dc"
        ],
        "TaskGroups": [
            {
                "Count": 3,
                "Tasks": [
                    {
                        "Driver": "docker",
                        "Resources": {
                            "CPU": 48,
                            "MemoryMB": 32,
                            "Networks": [
                                {
                                    "DynamicPorts": [
                                        {
                                            "label": "main_port"
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "PortLabel": "main_port",
                                "Checks": [
                                    {
                                        "Type": "http",
                                        "Interval": 10000000000,
                                        "Timeout": 5000000000,
                                        "CheckRestart": {
                                            "Limit": 3,
                                            "Grace": 30000000000
                                        },
                                        "Path": "/healthcheck-ok",
                                        "Name": "healthcheck"
                                    }
                                ],
                                "Name": "test-rolling-restart-service-main"
                            }
                        ],
                        "ShutdownDelay": 5000000000,
                        "Templates": [
                            {
                                "ChangeSignal": "SIGHUP",
                                "DestPath": "local/nginx.conf",
                                "Perms": "0644",
                                "ChangeMode": "signal",
                                "EmbeddedTmpl": "events {}\n\nhttp {\n  server {\n    location /healthcheck-ok {\n      return 200 'OK';\n      add_header Content-Type text/plain;\n    }\n\n    location /healthcheck-not-ok {\n      return 500 'NOT OK';\n      add_header Content-Type text/plain;\n    }\n  }\n}\n"
                            }
                        ],
                        "Config": {
                            "force_pull": true,
                            "image": "nginx:latest",
                            "port_map": [
                                {
                                    "main_port": 80
                                }
                            ],
                            "volumes": [
                                "local:/etc/nginx"
                            ]
                        },
                        "KillTimeout": 15000000000,
                        "Name": "main"
                    },
                    {
                        "Driver": "docker",
                        "Resources": {
                            "CPU": 48,
                            "MemoryMB": 32,
                            "Networks": [
                                {
                                    "DynamicPorts": [
                                        {
                                            "label": "secondary_port"
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "PortLabel": "secondary_port",
                                "Checks": [
                                    {
                                        "Type": "http",
                                        "Interval": 10000000000,
                                        "Timeout": 5000000000,
                                        "CheckRestart": {
                                            "Limit": 3,
                                            "Grace": 180000000000
                                        },
                                        "Path": "/healthcheck-ok",
                                        "Name": "healthcheck"
                                    }
                                ],
                                "Name": "test-rolling-restart-service-secondary"
                            }
                        ],
                        "ShutdownDelay": 5000000000,
                        "Templates": [
                            {
                                "ChangeSignal": "SIGHUP",
                                "DestPath": "local/nginx.conf",
                                "Perms": "0644",
                                "ChangeMode": "signal",
                                "EmbeddedTmpl": "events {}\n\nhttp {\n  server {\n    location /healthcheck-ok {\n      return 200 'OK';\n      add_header Content-Type text/plain;\n    }\n\n    location /healthcheck-not-ok {\n      return 500 'NOT OK';\n      add_header Content-Type text/plain;\n    }\n  }\n}\n"
                            }
                        ],
                        "Config": {
                            "force_pull": true,
                            "image": "nginx:latest",
                            "port_map": [
                                {
                                    "secondary_port": 80
                                }
                            ],
                            "volumes": [
                                "local:/etc/nginx"
                            ]
                        },
                        "KillTimeout": 15000000000,
                        "Name": "secondary"
                    }
                ],
                "RestartPolicy": {
                    "Attempts": 3,
                    "Delay": 15000000000,
                    "Interval": 180000000000,
                    "Mode": "fail"
                },
                "Update": {
                    "MaxParallel": 1,
                    "AutoRevert": true,
                    "HealthCheck": "checks",
                    "ProgressDeadline": 960000000000,
                    "HealthyDeadline": 900000000000,
                    "MinHealthyTime": 10000000000,
                    "Stagger": 10000000000
                },
                "Name": "test-nomad-rolling-deployments"
            }
        ]
    }
}

I've left off other logs as I think the repro steps are sufficient and this reproduces 100% of the time in our setup, but happy to gather some if necessary.

The text was updated successfully, but these errors were encountered:

djenriquez · 2020-03-12T03:27:02Z

We have also seen this problem in our testing with v0.10.4, and don't fully understand the situation but it's definitely a problem. Basically, no deployment will ever fail due to health checks currently.

We have a job whose task will never get healthy, yet for some reason, Nomad always passes the deployment. Consul properly reports the task as unhealthy.

Another big problem related to this is that it seems the restart only restarts successfully once. We've replicated this behavior a few times now. The check_restart does its job by triggering the first restart, and then nothing. It seems the check_restart after an allocation is restarted the first time is unable to trigger the restart policy.

We can mitigate this by setting the restart policy to 0 with mode: fail, that will always force a reschedule on the first failure. However, this speaks another problematic scenario which we cannot confirm: if a healthy task all of a sudden goes unhealthy, does the check_restart policy go into effect?

During testing, we've set the task's check's check_restart to grace:0, limit:3, with a check interval of 10s. This results successfully in the first restart after 20s (1st check happens immediately it seems, then 2 more checks of 10s).

Solid report @dpn, this regression seems pretty dire.

Thank you @kainoaseto and @tydomitrovich for finding and helping to troubleshoot.

djenriquez · 2020-03-12T22:29:42Z

Hi guys, apologies if I'm inflating the priority for this issue, but it seems pretty serious that we cannot depend on health checks of allocations during deployments.

Could we get confirmation that this issue has been acknowledged and is being prioritized (hopefully on the higher side)?

@tgross @drewbailey @dadgar ?

notnoop · 2020-03-13T13:34:25Z

@dpn @djenriquez This seems very bad indeed. I'll be investigating this now and will post updates when I get an understanding of the underlying issue and if there are any mitigating factors. Thank you very much for the detailed and clear reproducibility steps.

notnoop · 2020-03-16T15:46:21Z

Thanks again for the issue. It's indeed very serious - it affects virtually all deployments and affects nomad versions as old as 0.8.0, but I believe earlier.

It affects deployments where min_healthy_time is less than the restart delay. While the task is being restarted, nomad client may consider it healthy!

One workaround is to increase min_healthy_time to be higher than possible restart delays.

I'm working on the fix and aim to have it ready later this week.

dpn · 2020-03-16T17:57:16Z

Thanks @notnoop, really appreciate you digging into this. Do you think this will be backported to the 0.9 and 0.10 series of releases? I know we're lagging behind by being on 0.9 but we'll be finishing up our 0.10 validation soon and plan to migrate over once that's complete.

kainoaseto · 2020-03-16T18:55:14Z

Thank you @notnoop for looking into this and the workaround in the meantime! I will look at implementing that fix in our job's for our 0.10 clusters to mitigate this bug and will watch for the fix later this week.

kainoaseto · 2020-03-26T03:26:32Z

Hi @notnoop and anyone else that runs into this before the release of the fix in 0.11.0. I was able to test the mitigation by changing the Restart.Delay to be < min_healthy_time as was suggested and was able to:

have allocations fail during deployments from health checks failing
have allocations fail and reschedule from health checks failing

Thanks for the workaround!

Below is some sample configuration in case anyone else runs into the same thing:

All at the taskgroup level:

    "ReschedulePolicy": {
        "Attempts": 0,
        "Delay": 15000000000,
        "DelayFunction": "exponential",
        "Interval": 0,
        "MaxDelay": 60000000000,
        "Unlimited": true
      },
      "RestartPolicy": {
        "Attempts": 0,
        "Delay": 15000000000,
        "Interval": 1800000000000,
        "Mode": "fail"
      },
      "Services": [
        {
          "AddressMode": "auto",
          "Checks": [
            {
              "AddressMode": "",
              "CheckRestart": {
                "Grace": 10000000000,
                "IgnoreWarnings": false,
                "Limit": 3
              },
              "Command": "",
              "GRPCService": "",
              "GRPCUseTLS": false,
              "InitialStatus": "warning",
              "Interval": 10000000000,
              "Method": "GET",
              "Name": "healthy",
              "Path": "/healthcheck",
              "PortLabel": "my-service",
              "Protocol": "",
              "TLSSkipVerify": false,
              "TaskName": "",
              "Timeout": 5000000000,
              "Type": "http"
            }
          ],
    .
    .
    .
    "Update": {
        "AutoPromote": false,
        "AutoRevert": true,
        "Canary": 0,
        "HealthCheck": "checks",
        "HealthyDeadline": 300000000000,
        "MaxParallel": 1,
        "MinHealthyTime": 200000000000,
        "ProgressDeadline": 600000000000,
        "Stagger": 30000000000
      },

dpn · 2020-03-28T22:21:52Z

Thanks @notnoop for the quick fix!

Laboltus · 2021-01-21T09:54:26Z

I experience the same behavior with 0.11.3. Nomad does not wait until the current alloc's become healthy before restart the next ones.

github-actions · 2022-10-25T02:44:18Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop self-assigned this Mar 13, 2020

notnoop added the theme/deployments label Mar 13, 2020

notnoop added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 13, 2020

notnoop moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Mar 13, 2020

schmichael added the type/bug label Mar 16, 2020

This was referenced Mar 17, 2020

health: detect missing task checks #7366

Closed

health: detect failing tasks #7383

Merged

notnoop closed this as completed in #7383 Mar 25, 2020

Nomad - Community Issues Triage automation moved this from In Progress to Done Mar 25, 2020

Laboltus mentioned this issue Jan 29, 2021

Nomad does not wait until allocations become healthy. #9915

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade #7320

Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade #7320

dpn commented Mar 11, 2020

djenriquez commented Mar 12, 2020 •

edited

Loading

djenriquez commented Mar 12, 2020 •

edited

Loading

notnoop commented Mar 13, 2020

notnoop commented Mar 16, 2020

dpn commented Mar 16, 2020

kainoaseto commented Mar 16, 2020

kainoaseto commented Mar 26, 2020

dpn commented Mar 28, 2020

Laboltus commented Jan 21, 2021

github-actions bot commented Oct 25, 2022

Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade #7320

Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade #7320

Comments

dpn commented Mar 11, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

djenriquez commented Mar 12, 2020 • edited Loading

djenriquez commented Mar 12, 2020 • edited Loading

notnoop commented Mar 13, 2020

notnoop commented Mar 16, 2020

dpn commented Mar 16, 2020

kainoaseto commented Mar 16, 2020

kainoaseto commented Mar 26, 2020

dpn commented Mar 28, 2020

Laboltus commented Jan 21, 2021

github-actions bot commented Oct 25, 2022

djenriquez commented Mar 12, 2020 •

edited

Loading

djenriquez commented Mar 12, 2020 •

edited

Loading