v0.10.4 - Shutdown Delay not working as expected #7251

djenriquez · 2020-03-02T21:56:42Z

Nomad version

Server and Clients both on v0.10.4.

Operating system and Environment details

Issue

ShutDownDelay as defined in the Nomad docs is not being considered during deployments. Below is a jobfile that describes two tasks within a taskgroup with different ShutDownDelays. However, during deployment, it doesn't seem like they are considered based on two pieces of evidence:

We have an error that occurs when our proxy is terminated before our app, we continue to see that
Nomad's timestamps for allocation state do not coincide with what is expected given the shutdown delays:
Here we have two allocations representing a deployment. Version 18 was created at 13:28:51, so I could expect that the task with the 60s ShutDownDelay will not exit until 13:29:51 at the earliest.

However, if we look at the allocation, this one is the proxy specifically, which has the 60s shutdown delay:

This shows that the task received the shutdown signal 10s after 18 was created. This means that it completely ignored the task's shutdown delay OR it used the app's shutdown delay, which was 10s.
Now here's the app's task in that same allocation:

This should not be happening if the tasks' shutdown delays are truly considered.

Also, it's worth mentioning, when the TaskGroup's shutdown delay is updated, job plans do not detect the change:
When changing the TaskGroup's ShutdownDelay from null to 60000000000, the plan shows this:

Lastly, even though the tasks have shutdowndelays, shutdowndelays seem to be completely ignored until the group's shutdowndelay was defined, allocations were getting their shutdown signal immediately. After only adding a 10s shutdowndelay to the group did I notice delays, but it did not trickle down to the tasks. Reverting it back to null kept the allocations with the 10s before i reverted it back to null.

Reproduction steps

Create a job with a network-namespaced taskgroup.
Set the taskgroup's shutdown delay to null.
Assign shutdown delays to the tasks.
Force a job update, (I do this by changing values in the taskgroup's Meta map)
Watch the shutdown delays be ignored
Update the taskgroup's shutdown delay to some value (this will be a no-op deployment it seems)
Force another deployment so that the allocation's shutdown delay is part of the allocation's definition
Watch the taskgroup's shutdown delay seem to be considered
Revert the taskgroup's shutdown delay to null (another update deployment)
Force another deployment and watch the shutdown delay seem to exist for the group

Throughout all of this, watch as the task's shutdown delays are never considered.

Job file (if appropriate)

"TaskGroups": [
    {
      "Affinities": null,
      "Constraints": [
        {
          "LTarget": "${node.class}",
          "Operand": "=",
          "RTarget": "app"
        },
        {
          "LTarget": "${attr.vault.version}",
          "Operand": "semver",
          "RTarget": ">= 0.6.1"
        }
      ],
      "Count": 2,
      "EphemeralDisk": {
        "Migrate": false,
        "SizeMB": 300,
        "Sticky": false
      },
      "Meta": {},
      "Migrate": {
        "HealthCheck": "checks",
        "HealthyDeadline": 300000000000,
        "MaxParallel": 1,
        "MinHealthyTime": 10000000000
      },
      "Name": "<REDACTED>-app",
      "Networks": [
        {
          "CIDR": "",
          "Device": "",
          "DynamicPorts": [
            {
              "Label": "<REDACTED>-proxy",
              "To": 81,
              "Value": 0
            },
            {
              "Label": "<REDACTED>-admin",
              "To": 10021,
              "Value": 0
            }
          ],
          "IP": "",
          "MBits": 100,
          "Mode": "bridge",
          "ReservedPorts": null
        }
      ],
      "ReschedulePolicy": {
        "Attempts": 0,
        "Delay": 15000000000,
        "DelayFunction": "exponential",
        "Interval": 0,
        "MaxDelay": 60000000000,
        "Unlimited": true
      },
      "RestartPolicy": {
        "Attempts": 0,
        "Delay": 60000000000,
        "Interval": 180000000000,
        "Mode": "fail"
      },
      "Services": [
        {
          "AddressMode": "auto",
          "CanaryMeta": null,
          "CanaryTags": null,
          "Checks": [
            {
              "AddressMode": "",
              "Args": null,
              "CheckRestart": {
                "Grace": 60000000000,
                "IgnoreWarnings": false,
                "Limit": 3
              },
              "Command": "",
              "GRPCService": "",
              "GRPCUseTLS": false,
              "Header": null,
              "InitialStatus": "warning",
              "Interval": 20000000000,
              "Method": "",
              "Name": "alive",
              "Path": "",
              "PortLabel": "<REDACTED>-proxy",
              "Protocol": "",
              "TLSSkipVerify": false,
              "TaskName": "",
              "Timeout": 3000000000,
              "Type": "tcp"
            },
            {
              "AddressMode": "",
              "Args": null,
              "CheckRestart": {
                "Grace": 60000000000,
                "IgnoreWarnings": false,
                "Limit": 3
              },
              "Command": "",
              "GRPCService": "",
              "GRPCUseTLS": false,
              "Header": {
                "Host": [
                  "<REDACTED>"
                ]
              },
              "InitialStatus": "warning",
              "Interval": 60000000000,
              "Method": "GET",
              "Name": "available",
              "Path": "/healthcheck",
              "PortLabel": "<REDACTED>-admin",
              "Protocol": "",
              "TLSSkipVerify": false,
              "TaskName": "",
              "Timeout": 45000000000,
              "Type": "http"
            }
          ],
          "Connect": null,
          "Meta": {},
          "Name": "<REDACTED>",
          "PortLabel": "<REDACTED>-proxy",
          "Tags": [<REDACTED>]
        },
        {
          "AddressMode": "auto",
          "CanaryMeta": null,
          "CanaryTags": null,
          "Checks": [
            {
              "AddressMode": "",
              "Args": null,
              "CheckRestart": null,
              "Command": "",
              "GRPCService": "",
              "GRPCUseTLS": false,
              "Header": null,
              "InitialStatus": "warning",
              "Interval": 20000000000,
              "Method": "GET",
              "Name": "admin-healthcheck",
              "Path": "/healthcheck",
              "PortLabel": "<REDACTED>-admin",
              "Protocol": "",
              "TLSSkipVerify": false,
              "TaskName": "",
              "Timeout": 3000000000,
              "Type": "http"
            }
          ],
          "Connect": null,
          "Meta": {},
          "Name": "<REDACTED>-admin",
          "PortLabel": "<REDACTED>-admin",
          "Tags": [<REDACTED>]
        }
      ],
      "ShutdownDelay": null,
      "Spreads": [
        {
          "Attribute": "${attr.unique.network.ip-address}",
          "SpreadTarget": null,
          "Weight": 50
        }
      ],
      "Tasks": [
        {
          "Affinities": null,
          "Artifacts": null,
          "Config": {
            "ulimit": [
              {
                "nofile": "100000:100000"
              }
            ],
            "force_pull": true,
            "image": "registry.<REDACTED>.net/<REDACTED>:797a8ae77057f707dfe50a1d5509d1a38cda4b89",
            "logging": [
              {
                "config": [
                  {
                    "syslog-format": "rfc5424micro",
                    "tag": "<REDACTED>_<REDACTED>_registry.<REDACTED>.net/<REDACTED>:<REDACTED>${NOMAD_ALLOC_ID}_{{.ID}}",
                    "syslog-address": "udp://${attr.unique.network.ip-address}:514"
                  }
                ],
                "driver": "syslog"
              }
            ]
          },
          "Constraints": null,
          "DispatchPayload": null,
          "Driver": "docker",
          "Env": {
            <REDACTED>
          },
          "KillSignal": "",
          "KillTimeout": 1000000000,
          "Kind": "",
          "Leader": false,
          "LogConfig": {
            "MaxFileSizeMB": 10,
            "MaxFiles": 10
          },
          "Meta": {
            "Purpose": "service",
            "StackServiceName": "<REDACTED>"
          },
          "Name": "<REDACTED>-<REDACTED>",
          "Resources": {
            "CPU": 1000,
            "Devices": null,
            "DiskMB": 0,
            "IOPS": 0,
            "MemoryMB": 2000,
            "Networks": null
          },
          "Services": null,
          "ShutdownDelay": 10000000000,
          "Templates": [
            {
              "ChangeMode": "restart",
              "ChangeSignal": "",
              "DestPath": "secrets/rendered.env",
              "EmbeddedTmpl": "<REDACTED>",
              "Envvars": true,
              "LeftDelim": "{{",
              "Perms": "0644",
              "RightDelim": "}}",
              "SourcePath": "",
              "Splay": 5000000000,
              "VaultGrace": 15000000000
            }
          ],
          "User": "",
          "Vault": {
            "ChangeMode": "restart",
            "ChangeSignal": "SIGHUP",
            "Env": true,
            "Policies": [
              "<REDACTED>-<REDACTED>-<REDACTED>"
            ]
          },
          "VolumeMounts": null
        },
        {
          "Affinities": null,
          "Artifacts": null,
          "Config": {
            "ulimit": [
              {
                "nofile": "100000:100000"
              }
            ],
            "args": [
              "-l",
              "warn"
            ],
            "image": "registry.<REDACTED>/envoy-sidecar:v0.10.0"
          },
          "Constraints": null,
          "DispatchPayload": null,
          "Driver": "docker",
          "Env": {
            "SERVICE_PORT": "10020",
            "XDS_ADDRESS": "${attr.unique.network.ip-address}",
            "LISTENER_PROTOCOL": "http",
            "NAMESPACE": "<REDACTED>",
            "NODE_ID": "<REDACTED>-${NOMAD_ALLOC_ID}-proxy",
            "SERVICE_NAME": "<REDACTED>"
          },
          "KillSignal": "",
          "KillTimeout": 30000000000,
          "Kind": "",
          "Leader": true,
          "LogConfig": {
            "MaxFileSizeMB": 10,
            "MaxFiles": 10
          },
          "Meta": {
            "purpose": "proxy"
          },
          "Name": "<REDACTED>-<REDACTED>-proxy",
          "Resources": {
            "CPU": 250,
            "Devices": null,
            "DiskMB": 0,
            "IOPS": 0,
            "MemoryMB": 250,
            "Networks": null
          },
          "Services": null,
          "ShutdownDelay": 60000000000,
          "Templates": null,
          "User": "",
          "Vault": null,
          "VolumeMounts": null
        }
      ],
      "Update": {
        "AutoPromote": false,
        "AutoRevert": true,
        "Canary": 0,
        "HealthCheck": "checks",
        "HealthyDeadline": 200000000000,
        "MaxParallel": 1,
        "MinHealthyTime": 10000000000,
        "ProgressDeadline": 600000000000,
        "Stagger": 30000000000
      },
      "Volumes": null
    }

The text was updated successfully, but these errors were encountered:

djenriquez · 2020-03-02T21:58:17Z

@drewbailey this is related to the work in #6746, I don't think I'm missing anything in my job config, but yea, it does not seem to be working as expected, I'm not sure why.

drewbailey · 2020-03-04T15:11:00Z

Hey @djenriquez thanks for reporting, There seems to be a few things going on so I wanted to share a reproduction job file to discuss the different scenarios around the reproduction file.

repro.hcl

job "delay" {
  datacenters = ["dc1"]

  group "api" {
    # shutdown_delay = "30s"

    meta {
      foo = "bar"
    }

    service {
      tags = ["group"]
    }

    spread {
      attribute = "${node.datacenter}"

      target "dc1" {
        percent = 100
      }
    }

    task "web" {
      driver = "docker"

      # shutdown_delay = "10s"

      config {
        image = "hashicorpnomad/counter-api:v1"
      }
      service {
        tags = ["awesome"]
      }
    }

    task "dashboard" {
      driver = "docker"

      shutdown_delay = "60s"

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "hashicorpnomad/counter-dashboard:v1"
      }

      service {
        tags = ["dashboard-service"]
      }
    }
  }
}

Changes to Task Group shutdown_delay not reflected in job plan output #7265 has been created to fix changes to Task Group shutdown_delay not reflected in job plan output.
Currently for tasks, shutdown_delay applies to a service, so it will not be registered if a service doesn't exist. The serviceHook handles waiting for the delay in it's preKilling step
I'm investigating the deployments aspect further and will come back with some more info

edit:
I haven't been able to reproduce the deployment issue, it seems to be waiting on the tasks shutdown_delay and the new alloc will wait in pending until the time has elapsed, could you confirm that the shutdown_delays are working as expected as long as a service is also being registered?

edit 2:
Regarding bullet 2 and shutdown_delay being tied to service registration, we will be treating this as a bug and allow for shutdown_delay to run regardless of service registration since they are not explicitly tied together in the job spec.

djenriquez · 2020-03-04T21:21:25Z

Hi @drewbailey, yes, sorry there might have been some confusion in my reporting. The shutdown_delay does go into effect if the allocation's definition defines one at the taskgroup level. I guess I was expecting the task's shutdown_delay of 60s to go into effect, which is why I reported in the deployment that the shutdown signal was sent only 10s, and not 60 (60s was the task shutdown delay, 10s was the taskgroup shutdown delay).

However, you clarified that shutdown_delay only goes into effect IF the task has a service. Since we register services at the taskgroup level and not the task level (because of network namespacing), the task's shutdown_delay will never go into effect. Is this a true statement?

Does your edit no. 2 allow for tasks' shutdown_delay to go into effect regardless of service registration?

djenriquez · 2020-03-04T21:27:54Z

Interestingly, I'm looking at your repro.hcl and I see that both your task and taskgroup have service stanzas defined. I didn't realize you can define services both at the taskgroup and task level. How would you define checks for tasks in this case? Checks require PortLabel which are not available in the context of a task if a network namespace is defined, right?

danlsgiga · 2020-03-04T22:12:22Z

2. Currently for tasks, `shutdown_delay` applies to a service, so it will [not be registered if a service doesn't exist](https://github.com/hashicorp/nomad/blob/3284a34b4289d449970fa510d66c14135277eb66/client/allocrunner/taskrunner/task_runner_hooks.go#L99). The serviceHook handles waiting for the delay in it's[ preKilling step](https://github.com/hashicorp/nomad/blob/3284a34b4289d449970fa510d66c14135277eb66/client/allocrunner/taskrunner/service_hook.go#L134)

It would be great to have shutdown_delay honoured in any type of task. My use case is that I have batch jobs that run really fast and along with them I have a filebeat sidecar to send the logs to logstash. I have shutdown_delay set in the filebeat task to give it some time to read and push the logs but that does not happen and filebeat is killed immediately after the lead task is finished.

djenriquez · 2020-03-04T23:15:43Z

@danlsgiga a really really bad hack around right for now would actually be to set the shutdown signal as something that wont terminate the process, then use the kill_timeout as your actual shutdown. Yea I know, but it'll work.

Edit: Nevermind, you said batch job. So that'll run to completion by itself.

danlsgiga · 2020-03-05T21:49:39Z

yeah, didn't think about that option. Thanks for that... but I prefer to wait for the fix because... you know... if I set that hack, it will be there forever 😄

ryan-shaw · 2020-04-06T13:04:18Z

Also having the same issue, does shutdown_delay only apply to service type jobs? I have a system job which is not using the shutdown_delay

stale · 2020-07-05T13:35:29Z

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

sashayakovtseva · 2021-11-09T15:01:47Z

Having the same issue.
Nomad version: v1.1.2
Job type is service. Deploying with canaries (with canaries count matching the count of group), shutdown_delay at the task level is set to 2min, but right after canaries promotion I see allocs in a completed state.

UPD shutdown_delay at group level works though

sashayakovtseva · 2021-11-18T08:32:38Z

UPD shutdown_delay at group level does not work as expected. After canary promotion instead of:

deregistering services from consul
waiting shutdown_delay
sending shutdown delay

I have:

waiting shutdown_delay
deregistering services from consul
sending shutdown delay

That is a huge disappointment since shutdown_delay is vital for us to let external LBs to update configs with consul template.

sashayakovtseva · 2022-02-10T14:37:46Z

@tgross Sorry to bother you, but is there anything to do with this issue? I am willing to help with any investigations if needed

tgross · 2022-02-10T15:01:11Z

@sashayakovtseva this issue probably should have been closed when #7663 was merged for 0.11.1.

What you're describing isn't what this ticket is about (not respecting shutdown_delay at all unless there were also service blocks). So that why your request got a bit lost. Can you open a new issue describing your problem, along with a reproduction if possible?

sashayakovtseva · 2022-02-10T15:31:05Z

I've double checked this issue again. Task level shutdown_delay works, group level shutdown_delay does not. But this is not a problem for me anymore, I got behaviour that I needed. Thanks and sorry for my mistake again.

github-actions · 2022-10-11T02:43:45Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

drewbailey added theme/deployments stage/needs-investigation labels Mar 3, 2020

drewbailey added the stage/waiting-reply label Mar 4, 2020

drewbailey self-assigned this Mar 4, 2020

stale bot removed the stage/waiting-reply label Mar 4, 2020

drewbailey mentioned this issue Mar 4, 2020

Allow shutdown_delay to execute regardless of service registration #7271

Closed

stale bot added the stage/waiting-reply label Jul 5, 2020

tgross removed the stage/waiting-reply label Oct 2, 2020

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 4, 2021

tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

drewbailey removed their assignment May 17, 2021

tgross closed this as completed Feb 10, 2022

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Feb 10, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.10.4 - Shutdown Delay not working as expected #7251

v0.10.4 - Shutdown Delay not working as expected #7251

djenriquez commented Mar 2, 2020 •

edited

Loading

djenriquez commented Mar 2, 2020

drewbailey commented Mar 4, 2020 •

edited

Loading

djenriquez commented Mar 4, 2020 •

edited

Loading

djenriquez commented Mar 4, 2020 •

edited

Loading

danlsgiga commented Mar 4, 2020

djenriquez commented Mar 4, 2020 •

edited

Loading

danlsgiga commented Mar 5, 2020

ryan-shaw commented Apr 6, 2020

stale bot commented Jul 5, 2020

sashayakovtseva commented Nov 9, 2021 •

edited

Loading

sashayakovtseva commented Nov 18, 2021

sashayakovtseva commented Feb 10, 2022

tgross commented Feb 10, 2022 •

edited

Loading

sashayakovtseva commented Feb 10, 2022

github-actions bot commented Oct 11, 2022

v0.10.4 - Shutdown Delay not working as expected #7251

v0.10.4 - Shutdown Delay not working as expected #7251

Comments

djenriquez commented Mar 2, 2020 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

djenriquez commented Mar 2, 2020

drewbailey commented Mar 4, 2020 • edited Loading

djenriquez commented Mar 4, 2020 • edited Loading

djenriquez commented Mar 4, 2020 • edited Loading

danlsgiga commented Mar 4, 2020

djenriquez commented Mar 4, 2020 • edited Loading

danlsgiga commented Mar 5, 2020

ryan-shaw commented Apr 6, 2020

stale bot commented Jul 5, 2020

sashayakovtseva commented Nov 9, 2021 • edited Loading

sashayakovtseva commented Nov 18, 2021

sashayakovtseva commented Feb 10, 2022

tgross commented Feb 10, 2022 • edited Loading

sashayakovtseva commented Feb 10, 2022

github-actions bot commented Oct 11, 2022

djenriquez commented Mar 2, 2020 •

edited

Loading

drewbailey commented Mar 4, 2020 •

edited

Loading

djenriquez commented Mar 4, 2020 •

edited

Loading

djenriquez commented Mar 4, 2020 •

edited

Loading

djenriquez commented Mar 4, 2020 •

edited

Loading

sashayakovtseva commented Nov 9, 2021 •

edited

Loading

tgross commented Feb 10, 2022 •

edited

Loading