Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.10.4 - Shutdown Delay not working as expected #7251

Closed
djenriquez opened this issue Mar 2, 2020 · 15 comments
Closed

v0.10.4 - Shutdown Delay not working as expected #7251

djenriquez opened this issue Mar 2, 2020 · 15 comments

Comments

@djenriquez
Copy link

djenriquez commented Mar 2, 2020

Nomad version

Server and Clients both on v0.10.4.

Operating system and Environment details

Issue

ShutDownDelay as defined in the Nomad docs is not being considered during deployments. Below is a jobfile that describes two tasks within a taskgroup with different ShutDownDelays. However, during deployment, it doesn't seem like they are considered based on two pieces of evidence:

  1. We have an error that occurs when our proxy is terminated before our app, we continue to see that
  2. Nomad's timestamps for allocation state do not coincide with what is expected given the shutdown delays:
    Here we have two allocations representing a deployment. Version 18 was created at 13:28:51, so I could expect that the task with the 60s ShutDownDelay will not exit until 13:29:51 at the earliest.
    Screen Shot 2020-03-02 at 1 30 06 PM
    However, if we look at the allocation, this one is the proxy specifically, which has the 60s shutdown delay:
    Screen Shot 2020-03-02 at 1 30 37 PM
    This shows that the task received the shutdown signal 10s after 18 was created. This means that it completely ignored the task's shutdown delay OR it used the app's shutdown delay, which was 10s.
    Now here's the app's task in that same allocation:
    Screen Shot 2020-03-02 at 1 31 06 PM
    This should not be happening if the tasks' shutdown delays are truly considered.

Also, it's worth mentioning, when the TaskGroup's shutdown delay is updated, job plans do not detect the change:
When changing the TaskGroup's ShutdownDelay from null to 60000000000, the plan shows this:
Screen Shot 2020-03-02 at 1 48 25 PM

Lastly, even though the tasks have shutdowndelays, shutdowndelays seem to be completely ignored until the group's shutdowndelay was defined, allocations were getting their shutdown signal immediately. After only adding a 10s shutdowndelay to the group did I notice delays, but it did not trickle down to the tasks. Reverting it back to null kept the allocations with the 10s before i reverted it back to null.

Reproduction steps

  1. Create a job with a network-namespaced taskgroup.
  2. Set the taskgroup's shutdown delay to null.
  3. Assign shutdown delays to the tasks.
  4. Force a job update, (I do this by changing values in the taskgroup's Meta map)
  5. Watch the shutdown delays be ignored
  6. Update the taskgroup's shutdown delay to some value (this will be a no-op deployment it seems)
  7. Force another deployment so that the allocation's shutdown delay is part of the allocation's definition
  8. Watch the taskgroup's shutdown delay seem to be considered
  9. Revert the taskgroup's shutdown delay to null (another update deployment)
  10. Force another deployment and watch the shutdown delay seem to exist for the group

Throughout all of this, watch as the task's shutdown delays are never considered.

Job file (if appropriate)

"TaskGroups": [
    {
      "Affinities": null,
      "Constraints": [
        {
          "LTarget": "${node.class}",
          "Operand": "=",
          "RTarget": "app"
        },
        {
          "LTarget": "${attr.vault.version}",
          "Operand": "semver",
          "RTarget": ">= 0.6.1"
        }
      ],
      "Count": 2,
      "EphemeralDisk": {
        "Migrate": false,
        "SizeMB": 300,
        "Sticky": false
      },
      "Meta": {},
      "Migrate": {
        "HealthCheck": "checks",
        "HealthyDeadline": 300000000000,
        "MaxParallel": 1,
        "MinHealthyTime": 10000000000
      },
      "Name": "<REDACTED>-app",
      "Networks": [
        {
          "CIDR": "",
          "Device": "",
          "DynamicPorts": [
            {
              "Label": "<REDACTED>-proxy",
              "To": 81,
              "Value": 0
            },
            {
              "Label": "<REDACTED>-admin",
              "To": 10021,
              "Value": 0
            }
          ],
          "IP": "",
          "MBits": 100,
          "Mode": "bridge",
          "ReservedPorts": null
        }
      ],
      "ReschedulePolicy": {
        "Attempts": 0,
        "Delay": 15000000000,
        "DelayFunction": "exponential",
        "Interval": 0,
        "MaxDelay": 60000000000,
        "Unlimited": true
      },
      "RestartPolicy": {
        "Attempts": 0,
        "Delay": 60000000000,
        "Interval": 180000000000,
        "Mode": "fail"
      },
      "Services": [
        {
          "AddressMode": "auto",
          "CanaryMeta": null,
          "CanaryTags": null,
          "Checks": [
            {
              "AddressMode": "",
              "Args": null,
              "CheckRestart": {
                "Grace": 60000000000,
                "IgnoreWarnings": false,
                "Limit": 3
              },
              "Command": "",
              "GRPCService": "",
              "GRPCUseTLS": false,
              "Header": null,
              "InitialStatus": "warning",
              "Interval": 20000000000,
              "Method": "",
              "Name": "alive",
              "Path": "",
              "PortLabel": "<REDACTED>-proxy",
              "Protocol": "",
              "TLSSkipVerify": false,
              "TaskName": "",
              "Timeout": 3000000000,
              "Type": "tcp"
            },
            {
              "AddressMode": "",
              "Args": null,
              "CheckRestart": {
                "Grace": 60000000000,
                "IgnoreWarnings": false,
                "Limit": 3
              },
              "Command": "",
              "GRPCService": "",
              "GRPCUseTLS": false,
              "Header": {
                "Host": [
                  "<REDACTED>"
                ]
              },
              "InitialStatus": "warning",
              "Interval": 60000000000,
              "Method": "GET",
              "Name": "available",
              "Path": "/healthcheck",
              "PortLabel": "<REDACTED>-admin",
              "Protocol": "",
              "TLSSkipVerify": false,
              "TaskName": "",
              "Timeout": 45000000000,
              "Type": "http"
            }
          ],
          "Connect": null,
          "Meta": {},
          "Name": "<REDACTED>",
          "PortLabel": "<REDACTED>-proxy",
          "Tags": [<REDACTED>]
        },
        {
          "AddressMode": "auto",
          "CanaryMeta": null,
          "CanaryTags": null,
          "Checks": [
            {
              "AddressMode": "",
              "Args": null,
              "CheckRestart": null,
              "Command": "",
              "GRPCService": "",
              "GRPCUseTLS": false,
              "Header": null,
              "InitialStatus": "warning",
              "Interval": 20000000000,
              "Method": "GET",
              "Name": "admin-healthcheck",
              "Path": "/healthcheck",
              "PortLabel": "<REDACTED>-admin",
              "Protocol": "",
              "TLSSkipVerify": false,
              "TaskName": "",
              "Timeout": 3000000000,
              "Type": "http"
            }
          ],
          "Connect": null,
          "Meta": {},
          "Name": "<REDACTED>-admin",
          "PortLabel": "<REDACTED>-admin",
          "Tags": [<REDACTED>]
        }
      ],
      "ShutdownDelay": null,
      "Spreads": [
        {
          "Attribute": "${attr.unique.network.ip-address}",
          "SpreadTarget": null,
          "Weight": 50
        }
      ],
      "Tasks": [
        {
          "Affinities": null,
          "Artifacts": null,
          "Config": {
            "ulimit": [
              {
                "nofile": "100000:100000"
              }
            ],
            "force_pull": true,
            "image": "registry.<REDACTED>.net/<REDACTED>:797a8ae77057f707dfe50a1d5509d1a38cda4b89",
            "logging": [
              {
                "config": [
                  {
                    "syslog-format": "rfc5424micro",
                    "tag": "<REDACTED>_<REDACTED>_registry.<REDACTED>.net/<REDACTED>:<REDACTED>${NOMAD_ALLOC_ID}_{{.ID}}",
                    "syslog-address": "udp://${attr.unique.network.ip-address}:514"
                  }
                ],
                "driver": "syslog"
              }
            ]
          },
          "Constraints": null,
          "DispatchPayload": null,
          "Driver": "docker",
          "Env": {
            <REDACTED>
          },
          "KillSignal": "",
          "KillTimeout": 1000000000,
          "Kind": "",
          "Leader": false,
          "LogConfig": {
            "MaxFileSizeMB": 10,
            "MaxFiles": 10
          },
          "Meta": {
            "Purpose": "service",
            "StackServiceName": "<REDACTED>"
          },
          "Name": "<REDACTED>-<REDACTED>",
          "Resources": {
            "CPU": 1000,
            "Devices": null,
            "DiskMB": 0,
            "IOPS": 0,
            "MemoryMB": 2000,
            "Networks": null
          },
          "Services": null,
          "ShutdownDelay": 10000000000,
          "Templates": [
            {
              "ChangeMode": "restart",
              "ChangeSignal": "",
              "DestPath": "secrets/rendered.env",
              "EmbeddedTmpl": "<REDACTED>",
              "Envvars": true,
              "LeftDelim": "{{",
              "Perms": "0644",
              "RightDelim": "}}",
              "SourcePath": "",
              "Splay": 5000000000,
              "VaultGrace": 15000000000
            }
          ],
          "User": "",
          "Vault": {
            "ChangeMode": "restart",
            "ChangeSignal": "SIGHUP",
            "Env": true,
            "Policies": [
              "<REDACTED>-<REDACTED>-<REDACTED>"
            ]
          },
          "VolumeMounts": null
        },
        {
          "Affinities": null,
          "Artifacts": null,
          "Config": {
            "ulimit": [
              {
                "nofile": "100000:100000"
              }
            ],
            "args": [
              "-l",
              "warn"
            ],
            "image": "registry.<REDACTED>/envoy-sidecar:v0.10.0"
          },
          "Constraints": null,
          "DispatchPayload": null,
          "Driver": "docker",
          "Env": {
            "SERVICE_PORT": "10020",
            "XDS_ADDRESS": "${attr.unique.network.ip-address}",
            "LISTENER_PROTOCOL": "http",
            "NAMESPACE": "<REDACTED>",
            "NODE_ID": "<REDACTED>-${NOMAD_ALLOC_ID}-proxy",
            "SERVICE_NAME": "<REDACTED>"
          },
          "KillSignal": "",
          "KillTimeout": 30000000000,
          "Kind": "",
          "Leader": true,
          "LogConfig": {
            "MaxFileSizeMB": 10,
            "MaxFiles": 10
          },
          "Meta": {
            "purpose": "proxy"
          },
          "Name": "<REDACTED>-<REDACTED>-proxy",
          "Resources": {
            "CPU": 250,
            "Devices": null,
            "DiskMB": 0,
            "IOPS": 0,
            "MemoryMB": 250,
            "Networks": null
          },
          "Services": null,
          "ShutdownDelay": 60000000000,
          "Templates": null,
          "User": "",
          "Vault": null,
          "VolumeMounts": null
        }
      ],
      "Update": {
        "AutoPromote": false,
        "AutoRevert": true,
        "Canary": 0,
        "HealthCheck": "checks",
        "HealthyDeadline": 200000000000,
        "MaxParallel": 1,
        "MinHealthyTime": 10000000000,
        "ProgressDeadline": 600000000000,
        "Stagger": 30000000000
      },
      "Volumes": null
    }
@djenriquez
Copy link
Author

@drewbailey this is related to the work in #6746, I don't think I'm missing anything in my job config, but yea, it does not seem to be working as expected, I'm not sure why.

@drewbailey
Copy link
Contributor

drewbailey commented Mar 4, 2020

Hey @djenriquez thanks for reporting, There seems to be a few things going on so I wanted to share a reproduction job file to discuss the different scenarios around the reproduction file.

repro.hcl
job "delay" {
  datacenters = ["dc1"]

  group "api" {
    # shutdown_delay = "30s"

    meta {
      foo = "bar"
    }

    service {
      tags = ["group"]
    }

    spread {
      attribute = "${node.datacenter}"

      target "dc1" {
        percent = 100
      }
    }

    task "web" {
      driver = "docker"

      # shutdown_delay = "10s"

      config {
        image = "hashicorpnomad/counter-api:v1"
      }
      service {
        tags = ["awesome"]
      }
    }

    task "dashboard" {
      driver = "docker"

      shutdown_delay = "60s"

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "hashicorpnomad/counter-dashboard:v1"
      }

      service {
        tags = ["dashboard-service"]
      }
    }
  }
}
  1. Changes to Task Group shutdown_delay not reflected in job plan output  #7265 has been created to fix changes to Task Group shutdown_delay not reflected in job plan output.
  2. Currently for tasks, shutdown_delay applies to a service, so it will not be registered if a service doesn't exist. The serviceHook handles waiting for the delay in it's preKilling step
  3. I'm investigating the deployments aspect further and will come back with some more info

edit:
I haven't been able to reproduce the deployment issue, it seems to be waiting on the tasks shutdown_delay and the new alloc will wait in pending until the time has elapsed, could you confirm that the shutdown_delays are working as expected as long as a service is also being registered?

edit 2:
Regarding bullet 2 and shutdown_delay being tied to service registration, we will be treating this as a bug and allow for shutdown_delay to run regardless of service registration since they are not explicitly tied together in the job spec.

@djenriquez
Copy link
Author

djenriquez commented Mar 4, 2020

Hi @drewbailey, yes, sorry there might have been some confusion in my reporting. The shutdown_delay does go into effect if the allocation's definition defines one at the taskgroup level. I guess I was expecting the task's shutdown_delay of 60s to go into effect, which is why I reported in the deployment that the shutdown signal was sent only 10s, and not 60 (60s was the task shutdown delay, 10s was the taskgroup shutdown delay).

However, you clarified that shutdown_delay only goes into effect IF the task has a service. Since we register services at the taskgroup level and not the task level (because of network namespacing), the task's shutdown_delay will never go into effect. Is this a true statement?

Does your edit no. 2 allow for tasks' shutdown_delay to go into effect regardless of service registration?

@djenriquez
Copy link
Author

djenriquez commented Mar 4, 2020

Interestingly, I'm looking at your repro.hcl and I see that both your task and taskgroup have service stanzas defined. I didn't realize you can define services both at the taskgroup and task level. How would you define checks for tasks in this case? Checks require PortLabel which are not available in the context of a task if a network namespace is defined, right?

@danlsgiga
Copy link
Contributor

2. Currently for tasks, `shutdown_delay` applies to a service, so it will [not be registered if a service doesn't exist](https://github.com/hashicorp/nomad/blob/3284a34b4289d449970fa510d66c14135277eb66/client/allocrunner/taskrunner/task_runner_hooks.go#L99). The serviceHook handles waiting for the delay in it's[ preKilling step](https://github.com/hashicorp/nomad/blob/3284a34b4289d449970fa510d66c14135277eb66/client/allocrunner/taskrunner/service_hook.go#L134)

It would be great to have shutdown_delay honoured in any type of task. My use case is that I have batch jobs that run really fast and along with them I have a filebeat sidecar to send the logs to logstash. I have shutdown_delay set in the filebeat task to give it some time to read and push the logs but that does not happen and filebeat is killed immediately after the lead task is finished.

@djenriquez
Copy link
Author

djenriquez commented Mar 4, 2020

@danlsgiga a really really bad hack around right for now would actually be to set the shutdown signal as something that wont terminate the process, then use the kill_timeout as your actual shutdown. Yea I know, but it'll work.

Edit: Nevermind, you said batch job. So that'll run to completion by itself.

@danlsgiga
Copy link
Contributor

yeah, didn't think about that option. Thanks for that... but I prefer to wait for the fix because... you know... if I set that hack, it will be there forever 😄

@ryan-shaw
Copy link

Also having the same issue, does shutdown_delay only apply to service type jobs? I have a system job which is not using the shutdown_delay

@stale
Copy link

stale bot commented Jul 5, 2020

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 4, 2021
@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021
@drewbailey drewbailey removed their assignment May 17, 2021
@sashayakovtseva
Copy link

sashayakovtseva commented Nov 9, 2021

Having the same issue.
Nomad version: v1.1.2
Job type is service. Deploying with canaries (with canaries count matching the count of group), shutdown_delay at the task level is set to 2min, but right after canaries promotion I see allocs in a completed state.

UPD shutdown_delay at group level works though

@sashayakovtseva
Copy link

UPD shutdown_delay at group level does not work as expected. After canary promotion instead of:

  1. deregistering services from consul
  2. waiting shutdown_delay
  3. sending shutdown delay

I have:

  1. waiting shutdown_delay
  2. deregistering services from consul
  3. sending shutdown delay

That is a huge disappointment since shutdown_delay is vital for us to let external LBs to update configs with consul template.

@sashayakovtseva
Copy link

@tgross Sorry to bother you, but is there anything to do with this issue? I am willing to help with any investigations if needed

@tgross
Copy link
Member

tgross commented Feb 10, 2022

@sashayakovtseva this issue probably should have been closed when #7663 was merged for 0.11.1.

What you're describing isn't what this ticket is about (not respecting shutdown_delay at all unless there were also service blocks). So that why your request got a bit lost. Can you open a new issue describing your problem, along with a reproduction if possible?

@tgross tgross closed this as completed Feb 10, 2022
Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Feb 10, 2022
@sashayakovtseva
Copy link

I've double checked this issue again. Task level shutdown_delay works, group level shutdown_delay does not. But this is not a problem for me anymore, I got behaviour that I needed. Thanks and sorry for my mistake again.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

6 participants