Improvement: flag to wait for node to be drained #2736

pznamensky · 2017-06-24T01:32:54Z

I guess it would be useful not only for us:
what do you think about adding a flag like -wait-for-drain that will terminate nomad node-drain only when there were no more jobs on a node?
Something like:
nomad node-drain -enable -self -wait-for-drain
It would be useful as a part of start/stop scripts at a pre stop step:

wait until node is drained
kill nomad agent

The text was updated successfully, but these errors were encountered:

jippi · 2017-06-24T12:05:29Z

Would be super nice indeed

rickardrosen · 2017-06-25T15:27:10Z

That is the behaviour I actually expect when draining a cluster node of some sort.
I'd like to see it as the default drain functionallity, with an option to stop running jobs right away.

Nomad surprised me the first time I drained a node, when it stopped all jobs and re-allocated them.
I had missed that part of the fine manual and assumed things. :)

dvusboy · 2017-06-25T19:55:57Z

@pznamensky What do you mean by "terminate node-drain"? Do you mean automatically disable node-drain or do you mean terminate the client altogether? The bullet seems to suggest the latter, but "terminate node-drain" suggests the former.
My typical use of node-drain is for system update that requires a reboot. I'd enable node-drain wait for the drainage and then reboot the host. When the host comes back, and everything looks ok, I disable the node-drain state. In this case, I don't see how -wait-for-drain would help. Terminating the client doesn't seem to add much value either, e.g. in the case of upgrading Nomad itself. You still need to install the new version. A simple restart after the new version is installed seems to suffice. Do you expect the client to come back with node-drain automatically disabled? I don't think that's a good idea as you may want to verify the new version of Nomad or the rest of the system runs well before disabling nomad-drain.
@rickardrosen Do you expect the drained tasks to just die? Or do you expect the client just not to accept new allocation but let the tasks finish normally? Like #1523?

rickardrosen · 2017-06-25T20:23:54Z

I'd like to be able to tell a node not to take any new task allocations.
Subsequent updates to jobs with allocations on this draining node would re-allocate those tasks to other hosts.

As almost all our services are deployed continously, within a couple of days the node set to drain would be empty and ready for maintenance work.

dvusboy · 2017-06-25T20:26:09Z

What about existing allocations on the draining node? For long-running services, don't you need to terminate them?

rickardrosen · 2017-06-25T20:52:03Z

No, no action except moving allocations on job update/scheduling.
As jobs are updated frequently the draining node would normally be empty in a couple of days.

Say there's a couple of tasks left closing in on a maint. window, we'll just re-schedule those jobs again to empty the node and complete the drain.

Personally I would like to see:

"node-drain", node takes no further allocations and tasks will be re-allocated on updates to jobs with tasks on draining node.

2 "node-drain -stop" would pull the plug right away and re-allocate task immediately i.e current behaviour.

That's more in line with my experience of resource/connection "brokering" products like loadbalancers and such.

Does it make sense? :)

jippi · 2017-06-26T12:27:04Z

I've done something over in seatgeek/nomad-helper that is similar to whats being asked here..

The code basically wait for all allocations on the node to be in a state that isn't running or pending before unblocking :)

binaries are available for linux, darwin and windows

pznamensky · 2017-06-26T13:48:58Z

@dvusboy

What do you mean by "terminate node-drain"? Do you mean automatically disable node-drain or do you mean terminate the client altogether?

I mean enable drainage and keep the command nomad node-drain -enable -wait-for-drain running if there are any running allocations on the specific node.
For now nomad node-drain only enables drainage and if I want to perform a maintaining of a node I must to poll nomad node-status -allocs with an external script.
It would be nice to have such flag in nomad itself.

@jippi

I've done something over in seatgeek/nomad-helper that is similar to whats being asked here..

Great! Thanks!

burdandrei · 2017-06-29T19:42:32Z

I'm handling it like this:

here is the systemd unit:

[Unit]
Description=nomad agent
Requires=network-online.target
After=network-online.target

[Service]
LimitNOFILE=65536
EnvironmentFile=-/etc/default/nomad
Environment=GOMAXPROCS=2
Restart=on-failure
ExecStart=/usr/local/bin/nomad agent $OPTIONS -config /etc/nomad.d
ExecStartPost=/bin/sleep 2
ExecStartPost=/usr/local/bin/nomad node-drain -self -disable -yes
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/usr/local/bin/nomad node-drain -self -enable -yes
ExecStop=/usr/local/bin/nomad-drain-wait
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target

and here is the nomad-drain-wait:

#!/bin/bash
#
# Help systemd know that nomad is drained

set -i

while nomad node-status -self | grep -q running; do
 echo "Tasks are still running:"
 nomad node-status -self| grep running | awk '{print $3"-"$4}'
 sleep 1
done

blalor · 2017-06-29T19:59:38Z

Mine's a little more … ah … complicated.

https://gist.github.com/blalor/246eaf5755e784b353ab756a36a1142e

I make a best effort to ensure that allocations running on the local node are started up elsewhere, so that any ephemeral_disk can be migrated.

Even with all of this jiggery-pokery, systemd still doesn't always shut Nomad down cleanly when stopping the instance.

blalor · 2017-06-29T20:00:56Z

Also, this is basically the same thing as #2052.

jippi · 2017-06-29T20:07:45Z

@blalor thats a lot of code! :P feel free to use my Go binary instead :)

blalor · 2017-06-29T20:23:41Z

@jippi It is, but it works, and I wrote it 7 months ago… :-) Yours doesn't wait for evals triggered by the node draining to complete, and that's an important piece of making sure the ephemeral_disk is migrated.

jippi · 2017-06-29T20:47:44Z

@blalor I had the understanding that once the allocation is in stop mode, all migrations has been completed too - that DesiredStatus = stop transitioning to Status = stopped would mark that as completed too

blalor · 2017-06-29T20:52:18Z

My experience is that that is not the case. The old allocation will become stopped, but the new one is in pending until the data's migrated. [edit] as of 0.5.6.

shantanugadgil · 2017-07-07T15:20:56Z

This would be very useful for my use case as well.
I need to mark a machine for drain, let no now allocations go there, but let the ongoing workload finish.

I think a workaround using nomad-helper as mentioned by @jippi will be ok for now.

sprutner · 2017-07-28T16:53:48Z

@jippi I also have noticed that it does not wait for the other allocations to hit run state. The new allocations are in a pending as they pull from Docker Hub, and there is a period where it stops the allocs before the new ones come up.

github-actions · 2022-12-01T02:28:41Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

rickardrosen mentioned this issue Jun 29, 2017

Upgrading clusters using blue/green style of underlying clients. #2584

Closed

schmichael added theme/client theme/cli type/enhancement labels Jul 3, 2017

schmichael mentioned this issue Mar 20, 2018

Drain v2: add controlled draining #4010

Merged

schmichael closed this as completed in #4010 Mar 22, 2018

github-actions bot locked as resolved and limited conversation to collaborators Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement: flag to wait for node to be drained #2736

Improvement: flag to wait for node to be drained #2736

pznamensky commented Jun 24, 2017

jippi commented Jun 24, 2017

rickardrosen commented Jun 25, 2017

dvusboy commented Jun 25, 2017

rickardrosen commented Jun 25, 2017

dvusboy commented Jun 25, 2017

rickardrosen commented Jun 25, 2017

jippi commented Jun 26, 2017 •

edited

Loading

pznamensky commented Jun 26, 2017

burdandrei commented Jun 29, 2017

blalor commented Jun 29, 2017

blalor commented Jun 29, 2017

jippi commented Jun 29, 2017

blalor commented Jun 29, 2017

jippi commented Jun 29, 2017

blalor commented Jun 29, 2017 •

edited

Loading

shantanugadgil commented Jul 7, 2017

sprutner commented Jul 28, 2017

github-actions bot commented Dec 1, 2022

Improvement: flag to wait for node to be drained #2736

Improvement: flag to wait for node to be drained #2736

Comments

pznamensky commented Jun 24, 2017

jippi commented Jun 24, 2017

rickardrosen commented Jun 25, 2017

dvusboy commented Jun 25, 2017

rickardrosen commented Jun 25, 2017

dvusboy commented Jun 25, 2017

rickardrosen commented Jun 25, 2017

jippi commented Jun 26, 2017 • edited Loading

pznamensky commented Jun 26, 2017

burdandrei commented Jun 29, 2017

blalor commented Jun 29, 2017

blalor commented Jun 29, 2017

jippi commented Jun 29, 2017

blalor commented Jun 29, 2017

jippi commented Jun 29, 2017

blalor commented Jun 29, 2017 • edited Loading

shantanugadgil commented Jul 7, 2017

sprutner commented Jul 28, 2017

github-actions bot commented Dec 1, 2022

jippi commented Jun 26, 2017 •

edited

Loading

blalor commented Jun 29, 2017 •

edited

Loading