-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up consul earlier when destroying a task #2596
Clean up consul earlier when destroying a task #2596
Conversation
client/task_runner.go
Outdated
// cleanup removes Consul entries and calls Driver.Cleanup when a task is | ||
// stopping. Errors are logged. | ||
func (r *TaskRunner) cleanup() { | ||
func (r *TaskRunner) consulCleanup() { | ||
// Remove from Consul |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just turn this comment into a comment on the func like below and this is 👍
// consulCleanup removes the task from Consul
func ...
client/task_runner.go
Outdated
// cleanup removes Consul entries and calls Driver.Cleanup when a task is | ||
// stopping. Errors are logged. | ||
func (r *TaskRunner) cleanup() { | ||
func (r *TaskRunner) consulCleanup() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this method since it is wrapping just one line.
client/task_runner.go
Outdated
|
||
// cleanup calls Driver.Cleanup when a task is | ||
// stopping. Errors are logged. | ||
func (r *TaskRunner) cleanup() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RemoveTask call is idempotent so keep it here for the common cases.
client/task_runner.go
Outdated
|
||
// Remove from consul before killing the task so that traffic | ||
// can be rerouted | ||
r.consulCleanup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to below the !running
check and then just call r.consul.RemoveTask(r.alloc.ID, r.task)
client/task_runner.go
Outdated
@@ -918,6 +918,7 @@ func (r *TaskRunner) run() { | |||
select { | |||
case success := <-prestartResultCh: | |||
if !success { | |||
r.consulCleanup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove these and just call cleanup in all but the destroy case.
aac9964
to
aa2da9e
Compare
Thanks for opening this PR based on our discussion in Gitter! As mentioned in the discussion there, I'm finding that this change is necessary for doing things like zero-downtime rolling updates for pools of web servers, where the load balancer (like fabio or traefik) needs to see the consul service status change and take the backend server out of the pool before it actually starts shutting down. I was planning to open a similar PR later after we finished testing my fork, but I'm glad to see that this PR is well-received. One additional feature I've been testing with is adding a short I may want to open a PR with that change on top of this one in the near future, though I see no reason to hold up this straightforward PR, since we'd likely want the delay to be configurable, which implies touching many more files than this one touches. |
A quick question about this PR: is it possible to add a test for this change in behaviour? Because it is important to the proper functioning of our nomad-based deployment solution, I want to make sure it doesn't accidentally regress if someone in the future thinks they are cleaning up the codebase by removing a "duplicate" call. |
I'm not sure we want to add pauses in Nomad itself since everyone's needs differ. You could add a sleep to your application's signal handler to continuing accepting requests for a certain period of time. |
Right, that's why my PR would make the amount of the delay configurable, and disabled by default.
Users may not always have this level of control over the application. For example, one might be using Nomad to run an off-the-shelf application - if the off-the-shelf application doesn't include an configuration option for this (somewhat unusual) need, then the user would have to fork and run a patched version of that application. Even when the user is writing their own application, they may not have direct control over the signal handler, as for web servers this is usually implemented in the web server library (or web framework) that the application uses. Changing this behaviour may or may not be possible without maintaining a forked/patched version of the relevant library. Maybe I'm wrong, but I don't think "wait N seconds after SIGINT before actually starting termination of the application" is likely to be a configuration option in very many off-the-shelf applications and libraries. It does feel to me like this belongs at the level of the scheduler that is already carefully controlling all aspects of the rolling deployment. As long as it is configurable and disabled by default, it should be an unobtrusive feature of Nomad, and immensely useful to those like me who need it. |
Hi @schmichael I didn't do what you wanted because @dadgar 's change seemed to remove the need for it. Are you happy with this as it stands? |
@weargoggles - is it feasible to add a unit test for this before merging? |
@jemc Sorry, my Go skills don't extend to assertions about event ordering. |
Merged! Thanks @weargoggles! I can take it from here. @jemc Mind filing an issue with your use case? I think it's an interesting idea, but I want to make sure we come up with the best possible solution. |
@schmichael - yep, I'll file a new issue ticket for the discussion. |
I am confused by the plethora of comments. |
Yes, in 0.7.1 and later the logic when stopping a task is: (1) Remove from Consul, (2) Sleep if there's a See here for relevant code in 0.7.1: https://github.com/hashicorp/nomad/blob/v0.7.1/client/task_runner.go#L1212-L1222 |
Great, thanks!
Is the deregister from consul operation sent asynchronously? Basically
means the shutdown delay is the only "protection" we have from the
propogation time to deregister globally...
Does it make sense to verify record has been deregistered before proceeding
with the kill request
…On Mon, Apr 2, 2018, 20:33 Michael Schurter ***@***.***> wrote:
@alonalmog82 <https://github.com/alonalmog82>
Yes, in 0.7.1 and later the logic when stopping a task is: (1) Remove from
Consul, (2) Sleep if there's a shutdown_delay
<https://www.nomadproject.io/docs/job-specification/task.html#shutdown_delay>
configured, and (3) Stop task
See here for relevant code in 0.7.1:
https://github.com/hashicorp/nomad/blob/v0.7.1/client/task_runner.go#L1212-L1222
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2596 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AawJEnacUFXi2Evev08IVvhdrvtw2dOBks5tkmD6gaJpZM4NK2Yr>
.
|
Good question: it is asynchronous. All Consul operations happen in an asynchronous loop to handle Consul outages without blocking other Nomad operations.
If you don't find the |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
It seems as if consumers of a service would prefer to know as early as possible that it is going away. Currently the run loop waits until after the task has exited to even schedule the removal from consul. This change causes the consul client operations to be scheduled before the task is killed, with the intention of narrowing the window in which the task has finished and remains registered with consul.