Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement: flag to wait for node to be drained #2736

Closed
pznamensky opened this issue Jun 24, 2017 · 18 comments
Closed

Improvement: flag to wait for node to be drained #2736

pznamensky opened this issue Jun 24, 2017 · 18 comments

Comments

@pznamensky
Copy link

I guess it would be useful not only for us:
what do you think about adding a flag like -wait-for-drain that will terminate nomad node-drain only when there were no more jobs on a node?
Something like:
nomad node-drain -enable -self -wait-for-drain
It would be useful as a part of start/stop scripts at a pre stop step:

  • wait until node is drained
  • kill nomad agent
@jippi
Copy link
Contributor

jippi commented Jun 24, 2017

Would be super nice indeed

@rickardrosen
Copy link

That is the behaviour I actually expect when draining a cluster node of some sort.
I'd like to see it as the default drain functionallity, with an option to stop running jobs right away.

Nomad surprised me the first time I drained a node, when it stopped all jobs and re-allocated them.
I had missed that part of the fine manual and assumed things. :)

@dvusboy
Copy link

dvusboy commented Jun 25, 2017

@pznamensky What do you mean by "terminate node-drain"? Do you mean automatically disable node-drain or do you mean terminate the client altogether? The bullet seems to suggest the latter, but "terminate node-drain" suggests the former.
My typical use of node-drain is for system update that requires a reboot. I'd enable node-drain wait for the drainage and then reboot the host. When the host comes back, and everything looks ok, I disable the node-drain state. In this case, I don't see how -wait-for-drain would help. Terminating the client doesn't seem to add much value either, e.g. in the case of upgrading Nomad itself. You still need to install the new version. A simple restart after the new version is installed seems to suffice. Do you expect the client to come back with node-drain automatically disabled? I don't think that's a good idea as you may want to verify the new version of Nomad or the rest of the system runs well before disabling nomad-drain.
@rickardrosen Do you expect the drained tasks to just die? Or do you expect the client just not to accept new allocation but let the tasks finish normally? Like #1523?

@rickardrosen
Copy link

I'd like to be able to tell a node not to take any new task allocations.
Subsequent updates to jobs with allocations on this draining node would re-allocate those tasks to other hosts.

As almost all our services are deployed continously, within a couple of days the node set to drain would be empty and ready for maintenance work.

@dvusboy
Copy link

dvusboy commented Jun 25, 2017

What about existing allocations on the draining node? For long-running services, don't you need to terminate them?

@rickardrosen
Copy link

No, no action except moving allocations on job update/scheduling.
As jobs are updated frequently the draining node would normally be empty in a couple of days.

Say there's a couple of tasks left closing in on a maint. window, we'll just re-schedule those jobs again to empty the node and complete the drain.

Personally I would like to see:

  1. "node-drain", node takes no further allocations and tasks will be re-allocated on updates to jobs with tasks on draining node.

2 "node-drain -stop" would pull the plug right away and re-allocate task immediately i.e current behaviour.

That's more in line with my experience of resource/connection "brokering" products like loadbalancers and such.

Does it make sense? :)

@jippi
Copy link
Contributor

jippi commented Jun 26, 2017

I've done something over in seatgeek/nomad-helper that is similar to whats being asked here..

The code basically wait for all allocations on the node to be in a state that isn't running or pending before unblocking :)

binaries are available for linux, darwin and windows

@pznamensky
Copy link
Author

@dvusboy

What do you mean by "terminate node-drain"? Do you mean automatically disable node-drain or do you mean terminate the client altogether?

I mean enable drainage and keep the command nomad node-drain -enable -wait-for-drain running if there are any running allocations on the specific node.
For now nomad node-drain only enables drainage and if I want to perform a maintaining of a node I must to poll nomad node-status -allocs with an external script.
It would be nice to have such flag in nomad itself.

@jippi

I've done something over in seatgeek/nomad-helper that is similar to whats being asked here..

Great! Thanks!

@burdandrei
Copy link
Contributor

I'm handling it like this:

  • here is the systemd unit:
[Unit]
Description=nomad agent
Requires=network-online.target
After=network-online.target

[Service]
LimitNOFILE=65536
EnvironmentFile=-/etc/default/nomad
Environment=GOMAXPROCS=2
Restart=on-failure
ExecStart=/usr/local/bin/nomad agent $OPTIONS -config /etc/nomad.d
ExecStartPost=/bin/sleep 2
ExecStartPost=/usr/local/bin/nomad node-drain -self -disable -yes
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/usr/local/bin/nomad node-drain -self -enable -yes
ExecStop=/usr/local/bin/nomad-drain-wait
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target

and here is the nomad-drain-wait:

#!/bin/bash
#
# Help systemd know that nomad is drained

set -i

while nomad node-status -self | grep -q running; do
 echo "Tasks are still running:"
 nomad node-status -self| grep running | awk '{print $3"-"$4}'
 sleep 1
done

@blalor
Copy link
Contributor

blalor commented Jun 29, 2017

Mine's a little more … ah … complicated.

https://gist.github.com/blalor/246eaf5755e784b353ab756a36a1142e

I make a best effort to ensure that allocations running on the local node are started up elsewhere, so that any ephemeral_disk can be migrated.

Even with all of this jiggery-pokery, systemd still doesn't always shut Nomad down cleanly when stopping the instance.

@blalor
Copy link
Contributor

blalor commented Jun 29, 2017

Also, this is basically the same thing as #2052.

@jippi
Copy link
Contributor

jippi commented Jun 29, 2017

@blalor thats a lot of code! :P feel free to use my Go binary instead :)

@blalor
Copy link
Contributor

blalor commented Jun 29, 2017

@jippi It is, but it works, and I wrote it 7 months ago… :-) Yours doesn't wait for evals triggered by the node draining to complete, and that's an important piece of making sure the ephemeral_disk is migrated.

@jippi
Copy link
Contributor

jippi commented Jun 29, 2017

@blalor I had the understanding that once the allocation is in stop mode, all migrations has been completed too - that DesiredStatus = stop transitioning to Status = stopped would mark that as completed too

@blalor
Copy link
Contributor

blalor commented Jun 29, 2017

My experience is that that is not the case. The old allocation will become stopped, but the new one is in pending until the data's migrated. [edit] as of 0.5.6.

@shantanugadgil
Copy link
Contributor

This would be very useful for my use case as well.
I need to mark a machine for drain, let no now allocations go there, but let the ongoing workload finish.

I think a workaround using nomad-helper as mentioned by @jippi will be ok for now.

@sprutner
Copy link
Contributor

@jippi I also have noticed that it does not wait for the other allocations to hit run state. The new allocations are in a pending as they pull from Docker Hub, and there is a period where it stops the allocs before the new ones come up.

@github-actions
Copy link

github-actions bot commented Dec 1, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants