Configurable delay between deregistering service and killing task #2441

bsphere · 2017-03-13T19:37:51Z

I believe that for true rolling updates of jobs, the updated alloc service endpoints should be removed from Consul first, wait for a grace period for active connections to drain and then restart..

dadgar · 2017-03-13T19:53:35Z

@bsphere Nomad can't decide what that grace period is as it varies per job. The correct way to handle this is Nomad sends a signal that the application is being shutdown. The application should then fail its health check which will make consul not route traffic to that instance while it starts draining connections/work and then it should exit.

The service exists and thus is registered in Consul, the only thing changing is its status which is reflected by checks.

bsphere · 2017-03-13T22:19:05Z

Seems like a possible solution, that requires support from the task side. What about having the grace period in the job settings? This way "legacy" code is still supported On Mar 13, 2017 21:53, "Alex Dadgar" <notifications@github.com> wrote: @bsphere <https://github.com/bsphere> Nomad can't decide what that grace period is as it varies per job. The correct way to handle this is Nomad sends a signal that the application is being shutdown. The application should then fail its health check which will make consul not route traffic to that instance while it starts draining connections/work and then it should exit. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2441 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB2LTHwPuwQCo8RuD7W4UcISJ4cT7PrJks5rlZ7DgaJpZM4MbuDq> .

jemc · 2017-05-04T17:27:09Z

I think this feature is really important, and that counting on the application to handle it is not always feasible solution.

The first part of this solution (deregistering the consul service first, before initiating the kill sequence) was achieved in #2596.

I believe the next important step is to introduce a delay between the service deregistration and the kill, configurable as part of the Nomad job spec, with the intent of giving other services in a distributed system (like a load balancer) ample time to stop interacting with the service before it is killed.

Please see relevant discussion in #2607 and #2596. I think I've made some important arguments there that haven't been raised here in this ticket yet.

ygersie · 2017-07-27T20:48:14Z

Agree with everything @jemc posted. Ideally Nomad would put the related Consul service into maintenance mode with a configurable timeout (default of 1 second would already be enough in most cases) before initiating de-registration and SIGTERM. This is especially troublesome right now in combination with github.com/eBay/fabio. It takes a couple of 10's of ms before Fabio removes the route which leads to client side 503's. This is fairly problematic and I don't see a real nice solution for it except for introducing extra logic in all of our services.

This seems like a fairly trivial thing for Nomad to provide as opposed to the amount of development required to get every service handle a SIGTERM by first failing the health check, waiting and then shutting down.

ygersie · 2017-07-27T20:49:15Z

Also consider the fact that not every service we run with Nomad is under our control (Nginx would be 1 of them).

ygersie · 2017-07-27T20:54:35Z

I think the title of this issue should be renamed to Graceful shutdown or something, as this applies to all variations of stopping allocations (drain, stop job, deploy).

mlehner616 · 2017-07-27T23:37:31Z

@dropje86 Thank you for posting this, i actually have a half written issue that I was about to post today for exactly the same thing. This also particularly affects consul integration in regard to templates and the change_signal. The other use case is on deploy as well. It seems like nomad should have all the information it needs to trigger a `consul maint` or deregister and THEN kill/signal the alloc. This is going to be a big problem as we can't have client connections just simply dropped as we run deploys or change consul values. For deploys there is pretty a fairly straightforward work around of triggering a consul maint during the process but I think the use case we we'd have to have nomad do it is during that consul kv update.

…

On Thu, Jul 27, 2017 at 1:54 PM dropje86 ***@***.***> wrote: I think the title of this issue should be renamed to Graceful shutdown or something, as this applies to all variations of stopping jobs (drain, stop job, deploy). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2441 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqFRNl29lg31wGlJZ7MU-PT7isFXAjjks5sSPkNgaJpZM4MbuDq> .

skyrocknroll · 2017-08-13T15:20:44Z

This is really important for us. Right now we ignore the softkill signal so consul service gets De registered and a delay with kill_timeout. After that container is brutally killed. Providing a delay config would help us handling everything Gracefully @dadgar

schmichael · 2017-08-16T23:33:47Z

Proposal:

job "docs" {
  group "example" {
    task "server" {
      # ...
      
      # Delay between deregister and kill signal
      shutdown_delay = "5s"
    }
  }
}

Where shutdown_delay is the duration between deregistering services from Consul and sending the task the shutdown signal.

Defaults to 0 for backward compat.

Fixes #2441 Defaults to 0 (no delay) for backward compat and because this feature should be opt-in.

skyrocknroll · 2017-08-17T10:10:52Z

@schmichael This is just insanely awesome. Thanks ❤️ 💯

schmichael · 2017-08-17T21:40:17Z

Thanks for the input everyone! 0.6.1 should be coming out soon with this feature.

mlehner616 · 2017-08-19T06:00:54Z

@schmichael thank you for the attention on this, this will help with draining services a ton!

ygersie · 2017-08-19T10:22:39Z

Thanks @smichael very helpful!

github-actions · 2022-12-10T02:15:13Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added the stage/waiting-reply label Mar 13, 2017

dadgar added theme/discovery stage/thinking and removed stage/waiting-reply labels Mar 14, 2017

schmichael mentioned this issue May 4, 2017

Configurable delay after deregistering consul service, before killing task #2607

Closed

dadgar changed the title ~~Feature request: remove endpoint from Consul and wait for connection draining during "rolling updates"~~ Configurable delay between deregistering service and killing task Jul 28, 2017

schmichael added a commit that referenced this issue Aug 17, 2017

Add optional shutdown delay to tasks

beae45b

Fixes #2441 Defaults to 0 (no delay) for backward compat and because this feature should be opt-in.

schmichael mentioned this issue Aug 17, 2017

Add optional shutdown delay to tasks #3043

Merged

schmichael closed this as completed in #3043 Aug 17, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable delay between deregistering service and killing task #2441

Configurable delay between deregistering service and killing task #2441

bsphere commented Mar 13, 2017

dadgar commented Mar 13, 2017 •

edited

Loading

bsphere commented Mar 13, 2017 via email

jemc commented May 4, 2017

ygersie commented Jul 27, 2017

ygersie commented Jul 27, 2017

ygersie commented Jul 27, 2017 •

edited

Loading

mlehner616 commented Jul 27, 2017 via email

skyrocknroll commented Aug 13, 2017

schmichael commented Aug 16, 2017 •

edited

Loading

skyrocknroll commented Aug 17, 2017

schmichael commented Aug 17, 2017

mlehner616 commented Aug 19, 2017

ygersie commented Aug 19, 2017

github-actions bot commented Dec 10, 2022

Configurable delay between deregistering service and killing task #2441

Configurable delay between deregistering service and killing task #2441

Comments

bsphere commented Mar 13, 2017

dadgar commented Mar 13, 2017 • edited Loading

bsphere commented Mar 13, 2017 via email

jemc commented May 4, 2017

ygersie commented Jul 27, 2017

ygersie commented Jul 27, 2017

ygersie commented Jul 27, 2017 • edited Loading

mlehner616 commented Jul 27, 2017 via email

skyrocknroll commented Aug 13, 2017

schmichael commented Aug 16, 2017 • edited Loading

skyrocknroll commented Aug 17, 2017

schmichael commented Aug 17, 2017

mlehner616 commented Aug 19, 2017

ygersie commented Aug 19, 2017

github-actions bot commented Dec 10, 2022

dadgar commented Mar 13, 2017 •

edited

Loading

ygersie commented Jul 27, 2017 •

edited

Loading

schmichael commented Aug 16, 2017 •

edited

Loading