Kill Allocations when client is disconnected from servers #2185

diptanu · 2017-01-11T23:13:35Z

Nomad servers replaces the allocations running on a node when the client misses heartbeats. The client can be partitioned from the servers, or the client might just be dead but it doesn't mean that the allocations are actually dead when a client is disconnected. This might be a problem in some cases when certain applications need only a fixed number of shards running.

Nomad will solve the above problem by allowing the users to configure a time duration at a task group level which will make the client kill the allocations of that task group after it is disconnected from the server. In cases where the client will be dead too, drivers like exec or raw_exec which uses an executor for supervising the processes will kill the process and exit.

The text was updated successfully, but these errors were encountered:

kak-tus · 2017-01-12T12:28:36Z

May be #2184 and #2176 issued because of that problem?

drscre · 2017-01-12T13:38:41Z

@diptanu A related question:
To my understanding, currently Nomad client restarts all tasks when it reconnects back to server after connection loss. Even if it is a short network problem.
Is it by design? It seems more logical just to keep tasks running after reconnecting.

kak-tus · 2017-01-12T23:07:11Z

Workaround till 0.6 release:

Script in crontab

#!/usr/bin/env sh

export NOMAD_ADDR=...

files=$(ls ~/nomad/*)

for file in $files; do
  nomad run $file
done

dadgar · 2017-01-13T19:53:29Z

@drscre Not quite. So it depends on how long of a connection loss there is. The clients heartbeat to the server every 15-45 seconds depending on the size of the cluster. If you fail a heartbeat the server marks that node as down and will replace the allocations. When the node comes back it will detect that it shouldn't be running those allocations and kill them.

If you loose and regain connection within a heartbeat nothing will be restarted.

OferE · 2017-01-13T20:08:28Z

i asked something similar but got a refuse on this. Bad judgement IMHO.
issue #2069
I had to workaround this bug myself using my own raw_exec script.
Having containers running without any control on them in the cluster is a serious bug.
Imagine a cluster of 100 machines with containers running without any way for the admin to terminate them.

IMHO whenever there is a connection lost/killed agent/any issue with nomad that prevent the admin to terminate the machines from remote - the tasks on the nodes must suicide immediately.

On a 100 machines cluster these issues will happen on a daily basis.

drscre · 2017-01-14T23:05:16Z

@dadgar
Replacing allocations on just a single heartbeat miss looks very conservative. Client seems to be not resilient to connection problems at all. Connection can disappear or just lag for a very short period of time but, by bad luck, coincide with the heartbeat.

It would be great to have configurable timeout for the cases when the exact number of running tasks is not important and network is overloaded/laggy.

Also it seems to me that it's better for a Nomad server to kill allocation not at the time the node is lost, but at the time the node is back or after configurable timeout.
I can clearly see the benefit in the single-node case :-). Nomad won't even have to kill the tasks when the node comes back fast, because it will see that tasks are already running.

edwardbadboy · 2017-03-23T03:18:43Z

This is really helpful for running virtual machine (qemu) in nomad. Usually virtual machine disk image is on shared storage, and we want to keep exactly one instance of a particular virtual machine among the whole cluster, otherwise two instances of a vm writing to the same disk image will cause data corruption. For now we have to use exec driver with "consul lock ... qemu-kvm ...." to workaround this problem.

jfvubiquity · 2017-08-17T12:21:38Z

Is there any setting for nomad client (or will be in the future) that will
allow client to kill all allocations when disconnected from servers after certain amount of time?
I know it will kill all containers when reconnected, but I would be nice if client could
kill everything and do the cleanup without waiting for reconnection.

dadgar · 2017-08-21T17:38:30Z

@jfvubiquity there currently isn't. This would be the issue to watch for that feature.

edwardbadboy · 2017-08-29T07:36:08Z

I think there maybe two types of workloads, "at least N copies of instance" and "exactly N copies of instance". Being able to kill allocation to for the second use case.

I think it would be helpful for Nomad to provide a semaphore-like resource declaration to solve this problem. For example,

job "example" {
  group "example" {
    task "server" {
      resources {
        cpu    = 100
        memory = 256

        semaphore {
            name = 'xxx'
            slots = 3
            consume = 1
            lease_timeout = "60s"
        }

        network {
          mbits = 100
          port "http" {}
          port "ssh" {
            static = 22
          }
        }
      }
    }
  }
}

So Nomad can uses Raft to implement an semaphore, each instance of the task will consume one slot. When the client is offline for 60s, it loses the semaphore and kill the allocation, Nomad will be able to create a new instance on other node. Another way is to integrate Consul lock interface in the job specification.

alxark · 2017-08-29T08:57:06Z

@edwardbadboy your idea is just great. Currently i need to implement locks manually via mongodb database (going to replace with consul locks now) but nomad locking might solve a lot of my problems and reduce system complexity.

rmlsun · 2019-01-16T21:20:39Z

I can see the benefit of the proposed changes in this thread and I would like to have them too. That said, I'd like to be able to completely opt out for such timeout/cleanup behavior.

From a failure handling PoV, given such features, say for whatever reason Nomad client and server lost contact for a certain period of time, I'd like Nomad to NOT wipe out my infrastructure services, and offer me the chance to recover from such networking/Nomad outage without fighting all other infra outage fire at the same time.

This is one of the things I tested and like about Nomad in my destructive testing cases.

On a side note, DC/OS and k8s has similar behavior of NOT wiping out existing runtimes in such cases, at least for the versions that we tested and running in our infra. For a couple of times, DC/OS not wiping out all existing runtimes upon complete master nodes failure gave us the relief of only fighting DC/OS fire while existing services running on the broker DC/OS cluster remain unaffected.

tgross · 2020-01-31T13:54:41Z

Appending some notes to this issue for CSI support:

The unpublishing workflow for CSI assumes the cooperation of a healthy client that can report that allocs are terminal. On the client side we'll help to reconcile this by having client that is out-of-contact with the server for too long will mark its allocs as terminal, which will call NodeUnstageVolume and NodeUnpublishVolume. (aka "heart yeet")

tgross · 2020-04-08T13:30:19Z

@langmartin while we're working through the design for this, we should consider the cases of #6212 and #7607 as well.

github-actions · 2022-11-07T02:33:08Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

diptanu added theme/client type/enhancement labels Jan 11, 2017

diptanu added this to the v0.6.0 milestone Jan 11, 2017

schmichael removed this from the v0.6.0 milestone Jul 31, 2017

langmartin self-assigned this Dec 4, 2019

endocrimes mentioned this issue Jan 7, 2020

CSI Volume Attachment/Detachment #6904

Closed

tgross mentioned this issue Jan 31, 2020

CSI: implement client-side self-unpublish #7039

Closed

tgross added this to the 0.11.0 milestone Jan 31, 2020

tgross mentioned this issue Mar 10, 2020

csi: volume-consuming tasks fail after client restart #7200

Closed

tgross removed this from the 0.11.0 milestone Mar 30, 2020

tgross mentioned this issue Apr 2, 2020

CSI: move node unmount to server-driven RPCs #7596

Merged

This was referenced Apr 8, 2020

Task is restarted on the same host even if it is not necessary #7607

Closed

nomad restarting services after lost state even with restart {attempts=0} #6212

Closed

tgross added this to the 0.11.1 milestone Apr 9, 2020

tgross modified the milestones: 0.11.1, 0.11.2 Apr 22, 2020

tgross mentioned this issue May 12, 2020

docs for stop_on_client_disconnect stanza #7938

Merged

langmartin mentioned this issue May 12, 2020

server: stop after client disconnect #7939

Merged

5 tasks

langmartin closed this as completed in #7939 May 13, 2020

NomAnor mentioned this issue Apr 12, 2021

Safely running jobs with shared data #10366

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kill Allocations when client is disconnected from servers #2185

Kill Allocations when client is disconnected from servers #2185

diptanu commented Jan 11, 2017

kak-tus commented Jan 12, 2017

drscre commented Jan 12, 2017

kak-tus commented Jan 12, 2017

dadgar commented Jan 13, 2017

OferE commented Jan 13, 2017 •

edited

Loading

drscre commented Jan 14, 2017

edwardbadboy commented Mar 23, 2017

jfvubiquity commented Aug 17, 2017

dadgar commented Aug 21, 2017

edwardbadboy commented Aug 29, 2017 •

edited

Loading

alxark commented Aug 29, 2017

rmlsun commented Jan 16, 2019 •

edited

Loading

tgross commented Jan 31, 2020

tgross commented Apr 8, 2020

github-actions bot commented Nov 7, 2022

Kill Allocations when client is disconnected from servers #2185

Kill Allocations when client is disconnected from servers #2185

Comments

diptanu commented Jan 11, 2017

kak-tus commented Jan 12, 2017

drscre commented Jan 12, 2017

kak-tus commented Jan 12, 2017

dadgar commented Jan 13, 2017

OferE commented Jan 13, 2017 • edited Loading

drscre commented Jan 14, 2017

edwardbadboy commented Mar 23, 2017

jfvubiquity commented Aug 17, 2017

dadgar commented Aug 21, 2017

edwardbadboy commented Aug 29, 2017 • edited Loading

alxark commented Aug 29, 2017

rmlsun commented Jan 16, 2019 • edited Loading

tgross commented Jan 31, 2020

tgross commented Apr 8, 2020

github-actions bot commented Nov 7, 2022

OferE commented Jan 13, 2017 •

edited

Loading

edwardbadboy commented Aug 29, 2017 •

edited

Loading

rmlsun commented Jan 16, 2019 •

edited

Loading