Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill Allocations when client is disconnected from servers #2185

Closed
diptanu opened this issue Jan 11, 2017 · 15 comments · Fixed by #7939
Closed

Kill Allocations when client is disconnected from servers #2185

diptanu opened this issue Jan 11, 2017 · 15 comments · Fixed by #7939

Comments

@diptanu
Copy link
Contributor

diptanu commented Jan 11, 2017

Nomad servers replaces the allocations running on a node when the client misses heartbeats. The client can be partitioned from the servers, or the client might just be dead but it doesn't mean that the allocations are actually dead when a client is disconnected. This might be a problem in some cases when certain applications need only a fixed number of shards running.

Nomad will solve the above problem by allowing the users to configure a time duration at a task group level which will make the client kill the allocations of that task group after it is disconnected from the server. In cases where the client will be dead too, drivers like exec or raw_exec which uses an executor for supervising the processes will kill the process and exit.

@diptanu diptanu added this to the v0.6.0 milestone Jan 11, 2017
@kak-tus
Copy link

kak-tus commented Jan 12, 2017

May be #2184 and #2176 issued because of that problem?

@drscre
Copy link

drscre commented Jan 12, 2017

@diptanu A related question:
To my understanding, currently Nomad client restarts all tasks when it reconnects back to server after connection loss. Even if it is a short network problem.
Is it by design? It seems more logical just to keep tasks running after reconnecting.

@kak-tus
Copy link

kak-tus commented Jan 12, 2017

Workaround till 0.6 release:

Script in crontab

#!/usr/bin/env sh

export NOMAD_ADDR=...

files=$(ls ~/nomad/*)

for file in $files; do
  nomad run $file
done

@dadgar
Copy link
Contributor

dadgar commented Jan 13, 2017

@drscre Not quite. So it depends on how long of a connection loss there is. The clients heartbeat to the server every 15-45 seconds depending on the size of the cluster. If you fail a heartbeat the server marks that node as down and will replace the allocations. When the node comes back it will detect that it shouldn't be running those allocations and kill them.

If you loose and regain connection within a heartbeat nothing will be restarted.

@OferE
Copy link

OferE commented Jan 13, 2017

i asked something similar but got a refuse on this. Bad judgement IMHO.
issue #2069
I had to workaround this bug myself using my own raw_exec script.
Having containers running without any control on them in the cluster is a serious bug.
Imagine a cluster of 100 machines with containers running without any way for the admin to terminate them.

IMHO whenever there is a connection lost/killed agent/any issue with nomad that prevent the admin to terminate the machines from remote - the tasks on the nodes must suicide immediately.

On a 100 machines cluster these issues will happen on a daily basis.

@drscre
Copy link

drscre commented Jan 14, 2017

@dadgar
Replacing allocations on just a single heartbeat miss looks very conservative. Client seems to be not resilient to connection problems at all. Connection can disappear or just lag for a very short period of time but, by bad luck, coincide with the heartbeat.

It would be great to have configurable timeout for the cases when the exact number of running tasks is not important and network is overloaded/laggy.

Also it seems to me that it's better for a Nomad server to kill allocation not at the time the node is lost, but at the time the node is back or after configurable timeout.
I can clearly see the benefit in the single-node case :-). Nomad won't even have to kill the tasks when the node comes back fast, because it will see that tasks are already running.

@edwardbadboy
Copy link

This is really helpful for running virtual machine (qemu) in nomad. Usually virtual machine disk image is on shared storage, and we want to keep exactly one instance of a particular virtual machine among the whole cluster, otherwise two instances of a vm writing to the same disk image will cause data corruption. For now we have to use exec driver with "consul lock ... qemu-kvm ...." to workaround this problem.

@schmichael schmichael removed this from the v0.6.0 milestone Jul 31, 2017
@jfvubiquity
Copy link

Is there any setting for nomad client (or will be in the future) that will
allow client to kill all allocations when disconnected from servers after certain amount of time?
I know it will kill all containers when reconnected, but I would be nice if client could
kill everything and do the cleanup without waiting for reconnection.

@dadgar
Copy link
Contributor

dadgar commented Aug 21, 2017

@jfvubiquity there currently isn't. This would be the issue to watch for that feature.

@edwardbadboy
Copy link

edwardbadboy commented Aug 29, 2017

I think there maybe two types of workloads, "at least N copies of instance" and "exactly N copies of instance". Being able to kill allocation to for the second use case.

I think it would be helpful for Nomad to provide a semaphore-like resource declaration to solve this problem. For example,

job "example" {
  group "example" {
    task "server" {
      resources {
        cpu    = 100
        memory = 256

        semaphore {
            name = 'xxx'
            slots = 3
            consume = 1
            lease_timeout = "60s"
        }

        network {
          mbits = 100
          port "http" {}
          port "ssh" {
            static = 22
          }
        }
      }
    }
  }
}

So Nomad can uses Raft to implement an semaphore, each instance of the task will consume one slot. When the client is offline for 60s, it loses the semaphore and kill the allocation, Nomad will be able to create a new instance on other node. Another way is to integrate Consul lock interface in the job specification.

@alxark
Copy link

alxark commented Aug 29, 2017

@edwardbadboy your idea is just great. Currently i need to implement locks manually via mongodb database (going to replace with consul locks now) but nomad locking might solve a lot of my problems and reduce system complexity.

@rmlsun
Copy link

rmlsun commented Jan 16, 2019

I can see the benefit of the proposed changes in this thread and I would like to have them too. That said, I'd like to be able to completely opt out for such timeout/cleanup behavior.

From a failure handling PoV, given such features, say for whatever reason Nomad client and server lost contact for a certain period of time, I'd like Nomad to NOT wipe out my infrastructure services, and offer me the chance to recover from such networking/Nomad outage without fighting all other infra outage fire at the same time.

This is one of the things I tested and like about Nomad in my destructive testing cases.

On a side note, DC/OS and k8s has similar behavior of NOT wiping out existing runtimes in such cases, at least for the versions that we tested and running in our infra. For a couple of times, DC/OS not wiping out all existing runtimes upon complete master nodes failure gave us the relief of only fighting DC/OS fire while existing services running on the broker DC/OS cluster remain unaffected.

@tgross
Copy link
Member

tgross commented Jan 31, 2020

Appending some notes to this issue for CSI support:

The unpublishing workflow for CSI assumes the cooperation of a healthy client that can report that allocs are terminal. On the client side we'll help to reconcile this by having client that is out-of-contact with the server for too long will mark its allocs as terminal, which will call NodeUnstageVolume and NodeUnpublishVolume. (aka "heart yeet")

@tgross
Copy link
Member

tgross commented Apr 8, 2020

@langmartin while we're working through the design for this, we should consider the cases of #6212 and #7607 as well.

@github-actions
Copy link

github-actions bot commented Nov 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.