Clean up consul earlier when destroying a task #2596

weargoggles · 2017-04-27T22:35:40Z

It seems as if consumers of a service would prefer to know as early as possible that it is going away. Currently the run loop waits until after the task has exited to even schedule the removal from consul. This change causes the consul client operations to be scheduled before the task is killed, with the intention of narrowing the window in which the task has finished and remains registered with consul.

schmichael · 2017-04-27T22:48:27Z

client/task_runner.go

-// cleanup removes Consul entries and calls Driver.Cleanup when a task is
-// stopping. Errors are logged.
-func (r *TaskRunner) cleanup() {
+func (r *TaskRunner) consulCleanup() {
 	// Remove from Consul


Just turn this comment into a comment on the func like below and this is 👍

// consulCleanup removes the task from Consul func ...

dadgar · 2017-04-27T22:57:30Z

client/task_runner.go

-// cleanup removes Consul entries and calls Driver.Cleanup when a task is
-// stopping. Errors are logged.
-func (r *TaskRunner) cleanup() {
+func (r *TaskRunner) consulCleanup() {


Can you remove this method since it is wrapping just one line.

dadgar · 2017-04-27T22:57:53Z

client/task_runner.go


+// cleanup calls Driver.Cleanup when a task is
+// stopping. Errors are logged.
+func (r *TaskRunner) cleanup() {


The RemoveTask call is idempotent so keep it here for the common cases.

dadgar · 2017-04-27T22:58:53Z

client/task_runner.go

+
+				// Remove from consul before killing the task so that traffic
+				// can be rerouted
+				r.consulCleanup()


Can you move this to below the !running check and then just call r.consul.RemoveTask(r.alloc.ID, r.task)

dadgar · 2017-04-27T22:59:11Z

client/task_runner.go

@@ -918,6 +918,7 @@ func (r *TaskRunner) run() {
 			select {
 			case success := <-prestartResultCh:
 				if !success {
+					r.consulCleanup()


You can remove these and just call cleanup in all but the destroy case.

jemc · 2017-04-28T16:18:40Z

Thanks for opening this PR based on our discussion in Gitter! As mentioned in the discussion there, I'm finding that this change is necessary for doing things like zero-downtime rolling updates for pools of web servers, where the load balancer (like fabio or traefik) needs to see the consul service status change and take the backend server out of the pool before it actually starts shutting down.

I was planning to open a similar PR later after we finished testing my fork, but I'm glad to see that this PR is well-received.

One additional feature I've been testing with is adding a short sleep after deregistering the consul service, so that the change in status has some lead-time to propagate to the routing tables of the downstream load balancer.

I may want to open a PR with that change on top of this one in the near future, though I see no reason to hold up this straightforward PR, since we'd likely want the delay to be configurable, which implies touching many more files than this one touches.

jemc · 2017-04-28T16:20:54Z

A quick question about this PR: is it possible to add a test for this change in behaviour?

Because it is important to the proper functioning of our nomad-based deployment solution, I want to make sure it doesn't accidentally regress if someone in the future thinks they are cleaning up the codebase by removing a "duplicate" call.

schmichael · 2017-04-28T17:22:18Z

One additional feature I've been testing with is adding a short sleep after deregistering the consul service, so that the change in status has some lead-time to propagate to the routing tables of the downstream load balancer.

I'm not sure we want to add pauses in Nomad itself since everyone's needs differ. You could add a sleep to your application's signal handler to continuing accepting requests for a certain period of time.

jemc · 2017-04-28T17:35:46Z

everyone's needs differ.

Right, that's why my PR would make the amount of the delay configurable, and disabled by default.

You could add a sleep to your application's signal handler to continuing accepting requests for a certain period of time.

Users may not always have this level of control over the application. For example, one might be using Nomad to run an off-the-shelf application - if the off-the-shelf application doesn't include an configuration option for this (somewhat unusual) need, then the user would have to fork and run a patched version of that application.

Even when the user is writing their own application, they may not have direct control over the signal handler, as for web servers this is usually implemented in the web server library (or web framework) that the application uses. Changing this behaviour may or may not be possible without maintaining a forked/patched version of the relevant library.

Maybe I'm wrong, but I don't think "wait N seconds after SIGINT before actually starting termination of the application" is likely to be a configuration option in very many off-the-shelf applications and libraries. It does feel to me like this belongs at the level of the scheduler that is already carefully controlling all aspects of the rolling deployment. As long as it is configurable and disabled by default, it should be an unobtrusive feature of Nomad, and immensely useful to those like me who need it.

weargoggles · 2017-05-02T11:35:52Z

Hi @schmichael I didn't do what you wanted because @dadgar 's change seemed to remove the need for it. Are you happy with this as it stands?

jemc · 2017-05-02T15:30:46Z

@weargoggles - is it feasible to add a unit test for this before merging?

weargoggles · 2017-05-02T15:43:41Z

@jemc Sorry, my Go skills don't extend to assertions about event ordering.

schmichael · 2017-05-02T17:33:52Z

Merged! Thanks @weargoggles! I can take it from here.

@jemc Mind filing an issue with your use case? I think it's an interesting idea, but I want to make sure we come up with the best possible solution.

jemc · 2017-05-02T17:37:55Z

@schmichael - yep, I'll file a new issue ticket for the discussion.

alonalmog82 · 2018-04-01T11:20:52Z

I am confused by the plethora of comments.
Does this change means that in case of node-drain - the service in consul will be deregistered before the allocation be sent the kill signal?

schmichael · 2018-04-02T17:33:33Z

Does this change means that in case of node-drain - the service in consul will be deregistered before the allocation be sent the kill signal?
@alonalmog82

Yes, in 0.7.1 and later the logic when stopping a task is: (1) Remove from Consul, (2) Sleep if there's a shutdown_delay configured, and (3) Stop task

See here for relevant code in 0.7.1: https://github.com/hashicorp/nomad/blob/v0.7.1/client/task_runner.go#L1212-L1222

alonalmog82 · 2018-04-04T06:11:40Z

Great, thanks! Is the deregister from consul operation sent asynchronously? Basically means the shutdown delay is the only "protection" we have from the propogation time to deregister globally... Does it make sense to verify record has been deregistered before proceeding with the kill request

…

On Mon, Apr 2, 2018, 20:33 Michael Schurter ***@***.***> wrote: @alonalmog82 <https://github.com/alonalmog82> Yes, in 0.7.1 and later the logic when stopping a task is: (1) Remove from Consul, (2) Sleep if there's a shutdown_delay <https://www.nomadproject.io/docs/job-specification/task.html#shutdown_delay> configured, and (3) Stop task See here for relevant code in 0.7.1: https://github.com/hashicorp/nomad/blob/v0.7.1/client/task_runner.go#L1212-L1222 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2596 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AawJEnacUFXi2Evev08IVvhdrvtw2dOBks5tkmD6gaJpZM4NK2Yr> .

schmichael · 2018-04-04T22:51:13Z

Is the deregister from consul operation sent asynchronously?

Good question: it is asynchronous. All Consul operations happen in an asynchronous loop to handle Consul outages without blocking other Nomad operations.

Does it make sense to verify record has been deregistered before proceeding with the kill request

If you don't find the shutdown_delay sufficient I'm afraid so. Feel free to open an issue if you think Nomad should block until the service is removed from Consul. Perhaps Nomad could wait up to 5s for the registration to succeed before giving up and proceeding with the kill. I'm not sure we want to add yet another tuneable for this parameter, but a new issue would be a good place to do it.

github-actions · 2023-03-07T02:21:41Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

clean up consul earlier when destroying a task

06f3aac

schmichael requested changes Apr 27, 2017

View reviewed changes

dadgar reviewed Apr 27, 2017

View reviewed changes

address feedback

aa2da9e

weargoggles force-pushed the deregister-before-kill branch from aac9964 to aa2da9e Compare April 28, 2017 09:27

schmichael merged commit ba73ed5 into hashicorp:master May 2, 2017

This was referenced May 2, 2017

Configurable delay after deregistering consul service, before killing task #2607

Closed

Configurable delay between deregistering service and killing task #2441

Closed

github-actions bot locked as resolved and limited conversation to collaborators Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up consul earlier when destroying a task #2596

Clean up consul earlier when destroying a task #2596

weargoggles commented Apr 27, 2017

schmichael Apr 27, 2017

dadgar Apr 27, 2017

dadgar Apr 27, 2017

dadgar Apr 27, 2017

dadgar Apr 27, 2017

jemc commented Apr 28, 2017 •

edited

Loading

jemc commented Apr 28, 2017 •

edited

Loading

schmichael commented Apr 28, 2017 •

edited

Loading

jemc commented Apr 28, 2017 •

edited

Loading

weargoggles commented May 2, 2017

jemc commented May 2, 2017

weargoggles commented May 2, 2017

schmichael commented May 2, 2017

jemc commented May 2, 2017

alonalmog82 commented Apr 1, 2018

schmichael commented Apr 2, 2018 •

edited

Loading

alonalmog82 commented Apr 4, 2018 via email

schmichael commented Apr 4, 2018

github-actions bot commented Mar 7, 2023

Clean up consul earlier when destroying a task #2596

Clean up consul earlier when destroying a task #2596

Conversation

weargoggles commented Apr 27, 2017

schmichael Apr 27, 2017

Choose a reason for hiding this comment

dadgar Apr 27, 2017

Choose a reason for hiding this comment

dadgar Apr 27, 2017

Choose a reason for hiding this comment

dadgar Apr 27, 2017

Choose a reason for hiding this comment

dadgar Apr 27, 2017

Choose a reason for hiding this comment

jemc commented Apr 28, 2017 • edited Loading

jemc commented Apr 28, 2017 • edited Loading

schmichael commented Apr 28, 2017 • edited Loading

jemc commented Apr 28, 2017 • edited Loading

weargoggles commented May 2, 2017

jemc commented May 2, 2017

weargoggles commented May 2, 2017

schmichael commented May 2, 2017

jemc commented May 2, 2017

alonalmog82 commented Apr 1, 2018

schmichael commented Apr 2, 2018 • edited Loading

alonalmog82 commented Apr 4, 2018 via email

schmichael commented Apr 4, 2018

github-actions bot commented Mar 7, 2023

jemc commented Apr 28, 2017 •

edited

Loading

jemc commented Apr 28, 2017 •

edited

Loading

schmichael commented Apr 28, 2017 •

edited

Loading

jemc commented Apr 28, 2017 •

edited

Loading

schmichael commented Apr 2, 2018 •

edited

Loading