Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template: Allow rolling restart for restart change_mode. #2202

Closed
cyrilgdn opened this issue Jan 16, 2017 · 12 comments · Fixed by #2227
Closed

Template: Allow rolling restart for restart change_mode. #2202

cyrilgdn opened this issue Jan 16, 2017 · 12 comments · Fixed by #2227

Comments

@cyrilgdn
Copy link

cyrilgdn commented Jan 16, 2017

Nomad version

Nomad v0.5.2

Issue

We started using template stenza with consul.
It works well but the restart change_mode causes (a big) downtime even with multiple instances of the task.

Especially with the docker driver which removes the docker image and download it again during the restart (seems to be linked to #1530).

It would be great if the restarts can be done (optionally) more smoothly, ideally based on update strategy.

What do you think? Did I missed something ?

Thank you!

@dadgar
Copy link
Contributor

dadgar commented Jan 17, 2017

@cyrilgdn Have you all set the splay to something larger than the default value? That can be used to avoid the thundering herd behavior you are experiencing!

@cyrilgdn
Copy link
Author

@dadgar I've already try to use splay but with multiple instances, all the instances of the task will restart at the same time.

Here is my test Job:

job "template-test" {
    datacenters = ["dc1"]

    type = "batch"

    group "template-test" {
        count = 2
        task "template-test" {
            driver = "exec"

            config {
                command = "sh"
                args = ["-c", "sleep 5000; cat local/test.conf; exit 0"]
            }

            template {
                destination = "local/test.conf"
                data = "{{ key \"configtest\" }}"
                splay = "2m"
            }
        }
    }
}

In this case, when the value change in consul , Nomad will wait for a random time (but less than 2 minutes) and restart both instances at the time.

@akaspin
Copy link

akaspin commented Jan 20, 2017

@cyrilgdn The issue is about rolling restart. This means restart one-by-one not random.

@cyrilgdn
Copy link
Author

@akaspin Yes I know (given that I created the issue :)), that's what I'd like to do.

I only tried the splay option for an alternative way to avoid downtime.

@akaspin
Copy link

akaspin commented Jan 22, 2017

Ok. Finally I implement solution (https://hub.docker.com/r/akaspin/docker-backstab/). This designed for docker.

Design:

  1. Backstab reacts on changes in provided consul "trigger" template.
  2. Then trigger template changes backstab acquires lock in Consul and restarts managed container.
  3. After restart backstab may wait some time or/and wait for managed container health check.
  4. After wait backstab releases lock.

For now this implementation tested under CoreOS cluster with Nomad where I'm running Mesos.

@dadgar
Copy link
Contributor

dadgar commented Jan 22, 2017

@cyrilgdn You didn't miss something I did 😄 The splay wasn't being applied correctly. Fixed and will be in 0.5.3

@cyrilgdn
Copy link
Author

@dadgar Thanks for fixing the splay option!

But for me the question of this issue remains valid.

In my case, if I have 2 instances of the same task (count=2), I can set a big splay hopping both will not restart at the same time but this is not guaranteed.
It will be great to have a rolling restart option that guarantees, like job upgrade, that all instances will be restarted sequentially, to avoid any downtime.

@dadgar
Copy link
Contributor

dadgar commented Jan 31, 2017

@cyrilgdn Unfortunately we will not be doing coordinated restarts because of template changes. If this is a requirement for you I would suggest building some external tooling to manage the restarts.

@cyrilgdn
Copy link
Author

cyrilgdn commented Feb 1, 2017

@dadgar Thanks for your answer!

Not be doing, like never?

Maybe I missed or misunderstood something (as we started to use Nomad few weeks ago and just tested template stenza).
Am I the only one with this kind of problematic? Does every other users have an external tool to avoid downtimes on configuration changes?

@multani
Copy link
Contributor

multani commented Feb 1, 2017

@dadgar Wouldn't it be possible to reuse the the restart {} stanza in case the template's change_mode is set to restart?

It's probably possible to do something with external tooling, but restart in its current form is very basic compare to what's provided at the group level with the restart {} stanza for example. I would even dare to say that it's actually too limited and will confuse users as:

  • if change_mode = "restart" and splay is too low:
    • a task could be restarted as another task is already starting
    • all tasks could be restarted at the same time
    • in all cases, this will most probably cause down time
  • to mitigate this, the only way would be to increase the value of splay to be high enough so that the probability that all the tasks are restarting at time T diminishes, but:
    • that makes the template less attractive as the time it will take for the task to pick up the new configuration may be as high as this (higher) value of splay
    • it mitigates the problem we can have with a lower splay value, but the problem is still here actually, and we may still end up by lack of luck in a situation where all the tasks are being restarted at the same time.

All in all, it seems that this defeats the purpose of having the template stanza in the first place, if the solution to actually control what's happening is not to use it and to package Consul Template or some other external tooling inside the task itself.
Especially since Nomad already supports coordinated restarts, which will get even better in the future if I believe correctly.

@dadgar
Copy link
Contributor

dadgar commented Feb 1, 2017

In its current form we can not support that type of behavior. The plugin is executing locally on every client and there is no coordination. I have referenced a GitHub issue on consul-template that could solve this.

If consul-template supported locking then we could limit parallelism.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants