Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up consul earlier when destroying a task #2596

Merged
merged 2 commits into from
May 2, 2017

Conversation

weargoggles
Copy link
Contributor

It seems as if consumers of a service would prefer to know as early as possible that it is going away. Currently the run loop waits until after the task has exited to even schedule the removal from consul. This change causes the consul client operations to be scheduled before the task is killed, with the intention of narrowing the window in which the task has finished and remains registered with consul.

// cleanup removes Consul entries and calls Driver.Cleanup when a task is
// stopping. Errors are logged.
func (r *TaskRunner) cleanup() {
func (r *TaskRunner) consulCleanup() {
// Remove from Consul
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just turn this comment into a comment on the func like below and this is 👍

// consulCleanup removes the task from Consul
func ...

// cleanup removes Consul entries and calls Driver.Cleanup when a task is
// stopping. Errors are logged.
func (r *TaskRunner) cleanup() {
func (r *TaskRunner) consulCleanup() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this method since it is wrapping just one line.


// cleanup calls Driver.Cleanup when a task is
// stopping. Errors are logged.
func (r *TaskRunner) cleanup() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RemoveTask call is idempotent so keep it here for the common cases.


// Remove from consul before killing the task so that traffic
// can be rerouted
r.consulCleanup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to below the !running check and then just call r.consul.RemoveTask(r.alloc.ID, r.task)

@@ -918,6 +918,7 @@ func (r *TaskRunner) run() {
select {
case success := <-prestartResultCh:
if !success {
r.consulCleanup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove these and just call cleanup in all but the destroy case.

@jemc
Copy link

jemc commented Apr 28, 2017

Thanks for opening this PR based on our discussion in Gitter! As mentioned in the discussion there, I'm finding that this change is necessary for doing things like zero-downtime rolling updates for pools of web servers, where the load balancer (like fabio or traefik) needs to see the consul service status change and take the backend server out of the pool before it actually starts shutting down.

I was planning to open a similar PR later after we finished testing my fork, but I'm glad to see that this PR is well-received.

One additional feature I've been testing with is adding a short sleep after deregistering the consul service, so that the change in status has some lead-time to propagate to the routing tables of the downstream load balancer.

I may want to open a PR with that change on top of this one in the near future, though I see no reason to hold up this straightforward PR, since we'd likely want the delay to be configurable, which implies touching many more files than this one touches.

@jemc
Copy link

jemc commented Apr 28, 2017

A quick question about this PR: is it possible to add a test for this change in behaviour?

Because it is important to the proper functioning of our nomad-based deployment solution, I want to make sure it doesn't accidentally regress if someone in the future thinks they are cleaning up the codebase by removing a "duplicate" call.

@schmichael
Copy link
Member

schmichael commented Apr 28, 2017

One additional feature I've been testing with is adding a short sleep after deregistering the consul service, so that the change in status has some lead-time to propagate to the routing tables of the downstream load balancer.

I'm not sure we want to add pauses in Nomad itself since everyone's needs differ. You could add a sleep to your application's signal handler to continuing accepting requests for a certain period of time.

@jemc
Copy link

jemc commented Apr 28, 2017

everyone's needs differ.

Right, that's why my PR would make the amount of the delay configurable, and disabled by default.

You could add a sleep to your application's signal handler to continuing accepting requests for a certain period of time.

Users may not always have this level of control over the application. For example, one might be using Nomad to run an off-the-shelf application - if the off-the-shelf application doesn't include an configuration option for this (somewhat unusual) need, then the user would have to fork and run a patched version of that application.

Even when the user is writing their own application, they may not have direct control over the signal handler, as for web servers this is usually implemented in the web server library (or web framework) that the application uses. Changing this behaviour may or may not be possible without maintaining a forked/patched version of the relevant library.

Maybe I'm wrong, but I don't think "wait N seconds after SIGINT before actually starting termination of the application" is likely to be a configuration option in very many off-the-shelf applications and libraries. It does feel to me like this belongs at the level of the scheduler that is already carefully controlling all aspects of the rolling deployment. As long as it is configurable and disabled by default, it should be an unobtrusive feature of Nomad, and immensely useful to those like me who need it.

@weargoggles
Copy link
Contributor Author

Hi @schmichael I didn't do what you wanted because @dadgar 's change seemed to remove the need for it. Are you happy with this as it stands?

@jemc
Copy link

jemc commented May 2, 2017

@weargoggles - is it feasible to add a unit test for this before merging?

@weargoggles
Copy link
Contributor Author

@jemc Sorry, my Go skills don't extend to assertions about event ordering.

@schmichael schmichael merged commit ba73ed5 into hashicorp:master May 2, 2017
@schmichael
Copy link
Member

Merged! Thanks @weargoggles! I can take it from here.

@jemc Mind filing an issue with your use case? I think it's an interesting idea, but I want to make sure we come up with the best possible solution.

@jemc
Copy link

jemc commented May 2, 2017

@schmichael - yep, I'll file a new issue ticket for the discussion.

@alonalmog82
Copy link

I am confused by the plethora of comments.
Does this change means that in case of node-drain - the service in consul will be deregistered before the allocation be sent the kill signal?

@schmichael
Copy link
Member

schmichael commented Apr 2, 2018

Does this change means that in case of node-drain - the service in consul will be deregistered before the allocation be sent the kill signal?
@alonalmog82

Yes, in 0.7.1 and later the logic when stopping a task is: (1) Remove from Consul, (2) Sleep if there's a shutdown_delay configured, and (3) Stop task

See here for relevant code in 0.7.1: https://github.com/hashicorp/nomad/blob/v0.7.1/client/task_runner.go#L1212-L1222

@alonalmog82
Copy link

alonalmog82 commented Apr 4, 2018 via email

@schmichael
Copy link
Member

Is the deregister from consul operation sent asynchronously?

Good question: it is asynchronous. All Consul operations happen in an asynchronous loop to handle Consul outages without blocking other Nomad operations.

Does it make sense to verify record has been deregistered before proceeding with the kill request

If you don't find the shutdown_delay sufficient I'm afraid so. Feel free to open an issue if you think Nomad should block until the service is removed from Consul. Perhaps Nomad could wait up to 5s for the registration to succeed before giving up and proceeding with the kill. I'm not sure we want to add yet another tuneable for this parameter, but a new issue would be a good place to do it.

@github-actions
Copy link

github-actions bot commented Mar 7, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants