Add ability to restart all running tasks/allocs of a job #698

supernomad · 2016-01-22T14:45:48Z

So I would love the ability to restart tasks, at the very least restart an entire job, but preferably single allocations. This is very useful for when a particular allocation or job happens to get in a bad state.

I am thinking something like nomad restart <job> or nomad alloc-restart <alloc-id>.

One of my specific use cases, is I have a cluster of rabbitmq nodes, and at some point one of the nodes gets partitioned from the rest of the cluster. I would like to restart that specific node (allocation in nomad parlance), or be able to preform a rolling restart to the entire cluster (job in nomad parlance).

Does this sound useful?

The text was updated successfully, but these errors were encountered:

dadgar · 2016-01-22T23:18:33Z

Its not a bad idea! In the mean time if you just want to restart the job you can stop and then run it again�.

mkabischev · 2016-02-06T15:06:09Z

I think it will be good feature. Now i can stop and then run job, but it won`t be graceful.

gpaggi · 2016-04-19T16:45:38Z

+1
Another use case: most of our services read their configuration either from static files or consul and when there are any changes in the properties the services need to be rolling-restarted.
Stopping and starting the job would cause a service interruption and a blue/green deployment for a configuration change is a bit over kill.

@supernomad did you get a chance to look into it?

jtuthehien · 2016-05-24T07:54:40Z

+1 for this feature

c4milo · 2016-06-14T22:18:36Z

This is much needed in order to effectively reload configurations without having downtimes. As mentioned above, blue/green doesn't really scale well when you have too many tasks and it is sort of unpredictable since it depends on the specific app being deployed playing well with multiple versions of it running at the same time.

liclac · 2016-07-14T17:39:43Z

I'd very much like to see this, for a slightly different use case:

I have something running as a system job (in this case, a wrapper script that essentially does docker pull ... && docker run ..., it needs to mount a host directory to work, this is a workaround for #150). To roll out an update, I currently need to change a dummy environment variable, or Nomad won't know anything changed.

mohitarora · 2016-08-22T19:47:49Z

+1

dennybaa · 2016-09-15T21:40:17Z

Why not, guys please add it, should be trivial.

jippi · 2016-09-27T09:50:04Z

👍 on this feature as well :)

xyzjace · 2017-01-16T05:37:30Z

👍 For us, too.

ashald · 2017-01-26T16:02:36Z

We would be happy to see this feature as well. Sometimes... services just need a manual restart. :( Would be nice if it was possible to restart individual tasks or task groups.

rokka-n · 2017-01-26T19:47:56Z

Having rolling "restart" option is a very valid case for tasks/jobs.

jippi · 2017-01-26T21:01:35Z

What i've done as a hack is to have a key_or_default inline template{} stanza in the task stanza for each of these keys, simply writing them to some random temp file

apps/${NOMAD_JOB_NAME}
apps/${NOMAD_JOB_NAME}/${NOMAD_TASK_NAME}
apps/${NOMAD_JOB_NAME}/${NOMAD_TASK_NAME}/${NOMAD_ALLOC_INDEX}
apps/${NOMAD_ALLOC_NAME}

that each got a change_type = restart or signal with the appropriate change_signal value

so i can do manual rolling restart of any nomad task by simply changing or creating one of those consul keys in my cluster programatically... at my own pace to do a controlled restart too :)

writing to consul KV /apps/${NOMAD_JOB_NAME} will restart all tasks in the job
writing to consul KV /apps/${NOMAD_JOB_NAME}/${NOMAD_TASK_NAME} will restart all tasks within a job
writing to consul KV /apps/${NOMAD_JOB_NAME}/${NOMAD_TASK_NAME}/${NOMAD_ALLOC_INDEX} will restart one specific task index within the job

ashald · 2017-01-26T21:14:34Z

@jippi that's super smart! Thanks, I guess I'll use that for the time being. :)

But that level of control is something that would be great to see in Nomad's native API.

P.S.: That reminds me about my hack/workaround to secure any resource in Nginx (e.g., Nomad API) using Consul ACL tokens with auth_request to some read-only api endpoints. :D

pznamensky · 2017-08-29T15:25:10Z

Would be useful for us too.

dansteen · 2017-09-06T18:33:21Z

This would also be useful for the new deployment stuff. The ability to re-trigger a deployment would be great.

JewelPengin · 2017-09-06T23:58:04Z

Throwing in my +1 but also my non-consul based brute force way:

export NOMAD_ADDR=http://[server-ip]:[admin-port]

curl $NOMAD_ADDR/v1/job/:jobId | jq '.TaskGroups[0].Count = 0 | {"Job": .}' | curl -X POST -d @- $NOMAD_ADDR/v1/job/:jobId

sleep 5

curl $NOMAD_ADDR/v1/job/:jobId | jq '.TaskGroups[0].Count = 1 | {"Job": .}' | curl -X POST -d @- $NOMAD_ADDR/v1/job/:jobId

It requires the jq binary to be installed (which I would highly recommend anyway), but it will first grab the job, modify the task group count to 0, post it back to update, then all over again back to 1 (or whatever number is needed).

Again, kinda brute force and not as elegant as @jippi's, but it works if I need to get something done quickly.

danielwpz · 2017-09-14T22:52:55Z

Really useful feature! Please do it :D

sullivanchan · 2017-09-19T05:45:18Z

I have do some verification follow @jippi suggestion, and data = "{{ key apps/app1/app1/${NOMAD_ALLOC_INDEX} }}" in template stanza, but job start always pending, seems env just get by https://www.nomadproject.io/docs/job-specification/template.html#inline-template {{ env "ENV_VAR" }}, i want to know how to integrate env variable into key string, does anybody have the same question?

mildred · 2017-09-19T08:52:18Z

This is standard golang template:

          {{keyOrDefault (printf "apps/app1/app1/%s" (env "NOMAD_ALLOC_INDEX")) ""}}

mildred · 2017-09-19T08:52:55Z

I suggest you use keyOrDefault instead of just key which will prevent your service to start unless the key exists in consul.

thevilledev · 2018-02-22T19:16:34Z

As a workaround I've been using Nomad's meta stanza to control restarts. Meta keys are populated as environment variables to tasks, so whenever meta block is changed all related tasks (or task groups) are restarted. Meta blocks can be defined on the top-level of the job, per task-group or per task.

For example to restart all tasks in all task groups you could run this:

$ nomad inspect some-job | \
jq --arg d "$(date)" '.Job.Meta={restarted_at: $d}' | \
curl -X POST -d @- nomad.service.consul:4646/v1/jobs

This follows update stanza as well.

maihde · 2018-03-02T03:14:56Z

I have made a first pass at implementing this, you can find my changes here.

Basically, I've added a -restart flag to nomad run. For example:

nomad run -restart myjob.nomad

When the -restart flag is applied it triggers an update, the same as if you would have changed the meta block, so you get the benefits of canaries and rolling restarts without having to actually change the job file.

If there is agreement that this implementation is going down the right path, I will go the the trouble of writing tests and making sure it works for system scheduler, parameterized jobs, etc.

jovandeginste · 2018-03-02T08:51:48Z

Why not implement this without the need for a plan? Basically, nomad restart myjobname (which should use the current plan)

As a sysop, I sometimes need to force a restart of a job, but I don't have the plan (and don't want to go through nomad inspect | parse)

rkettelerij · 2018-03-02T09:22:29Z

Agreeing with @jovandeginste here. A restart shouldn't need a job definition in my option, since the job definition is already known inside Nomad.

jovandeginste · 2018-03-02T09:59:28Z

I do see the case to re-submit an existing job with a plan that may or may not have changed but always wanting to force a restart (of the whole job) while submitting. So both are interesting options.

Oloremo · 2020-12-01T21:18:02Z

looking forward to this as well

datadexer · 2020-12-02T17:28:02Z

same here!

OmarQunsul · 2021-01-07T22:05:45Z

I am also surprised this feature doesn't exist. In Docker Swarm for example docker service update --force SERVICE_NAME.
I was expecting something under the job command nomad job restart, that restarts each alloc without downtime on the whole job

jpasichnyk · 2021-02-24T22:34:07Z

+1 for this feature. We just moved to nomad 1.x and are trying to move to the built in Nomad UI (from HashiUI - https://github.com/jippi/hashi-ui), and having the ability to restart a job from here would be great. Sometimes we have application instances that go unhealthy from a system perspective but are still running fine in docker. In this case we don't want to force restart them as depending on the reason they are unhealthy they may not be able to safely restart. Restarting the whole job via a rolling restart is a great way to fix this state, but there is no way to do it for us other than building a new container version and promoting a new job over the existing job (even if the bits being deployed are identical). HashiUI can restart via rolling restart or a stop/start. Nomad UI and CLI should support doing this as well.

thatsk · 2021-04-27T14:18:00Z

is this added in nomad UI. Or still in phase.?

stupidlamo · 2021-05-28T10:15:04Z

+1 to this feature, really need to shut down hashi-ui and use only nomad native, but can't due to unvailabilty of rolling restart

kunalsingthakur · 2021-07-29T06:07:47Z

yeah @tgross there is situation where container dependent on consul key-value and if we update key value in consul then after restart our service it will populate new values in out container so we really think this need to be allocated in nomad UI and get rid of hashiui . don't need to maintain two UI for nomad

kunalsingthakur · 2021-07-29T06:08:13Z

are we supposed to think this is on our roadmap

dg-eparizzi · 2021-08-13T10:19:23Z

+1 to this feature

josegonzalez · 2021-09-01T07:48:42Z

The way hashi-ui implements this is by injecting a label into the job, which messes with nomad job plan as the same job will result in a change as the local job won't have the injected label.

thatsk · 2021-09-01T08:00:40Z

Yes they are adding meta param like date time stamp

…

On Wed, 1 Sep 2021, 13:19 Jose Diaz-Gonzalez, ***@***.***> wrote: The way hashi-ui implements this is by injecting a label into the job, which messes with nomad job plan as the same job will result in a change as the local job won't have the injected label. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#698 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADU5G3LCV2KYJY6AXQFFCJTT7XLGZANCNFSM4BZJUGLQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

sbrl · 2021-09-03T20:59:36Z

The way hashi-ui implements this is by injecting a label into the job, which messes with nomad job plan as the same job will result in a change as the local job won't have the injected label.

An easy CLI subcommand / HTTP API call that function would be very handy.

victusfate · 2021-09-30T18:29:32Z

I ended up getting what I wanted (a rolling restart of an existing application) using the following python snippet and the Nomad HTTP API

    get_job_url = NOMAD_URL + os.path.join('/v1/job',job_id)
    get_job_response = requests.get(get_job_url)
    job = get_job_response.json()
    if 'Meta' not in job or job['Meta'] is None:
      job['Meta'] = {}
    job['Meta']['Restart'] = str(time.time())
    job = { 'Job': job, 'PreserveCounts': True }
    
    # now post it back
    post_url = NOMAD_URL + os.path.join('/v1/jobs')
    post_job_response = requests.post(post_url,json=job)
    print('restart job response',post_job_response.json())

maxadamo · 2022-02-07T16:47:43Z

Unless I'm overlooking a possible drawback, the command suggested by @mxab looks good to me.
You can use any variation of the command and add it onto your shell aliases:

nomad job status <job-name> | awk '{if (/run(.*)running/) {system("nomad alloc restart " $1)}}'
nomad job status <job-name> | awk '/run(.*)running/{print $1}' | xargs -t -n 1 nomad alloc restart

Laboltus · 2022-03-12T13:33:14Z

Unless I'm overlooking a possible drawback, the command suggested by @mxab looks good to me. You can use any variation of the command and add it onto your shell aliases:
nomad job status <job-name> | awk '{if (/run(.*)running/) {system("nomad alloc restart " $1)}}'
nomad job status <job-name> | awk '/run(.*)running/{print $1}' | xargs -t -n 1 nomad alloc restart

As I understand "nomad alloc restart" doesn't re-download artifacts and docker images ? I need to restart a job with an actual docker image.

tgross · 2022-08-23T13:42:27Z

Doing some issue cleanup and realizing there's a whole lot of different feature requests being discussed in this issue over the years, many of which landed long ago. I'm going to re-title this issue to narrow the scope to the remaining request.

…mad#698, hashicorp/nomad#10727, hashicorp/nomad#10039, hashicorp/nomad#1635, hashicorp/nomad#8368 + basic readme

EugenKon · 2024-02-23T16:19:32Z

With the command above I can not restart the task which failed:

$ nomad job restart -task nginx-task portal
==> 2024-02-23T11:13:56-05:00: Restarting 1 allocation
    2024-02-23T11:13:56-05:00: Restarting task "nginx-task" in allocation "27caddf2" for group "services"
==> 2024-02-23T11:13:56-05:00: Job restart finished with errors

1 error occurred while restarting job:
* Error restarting allocation "27caddf2": Failed to restart task "nginx-task": Unexpected response code: 500 (Task not running)


$ nomad alloc restart 27caddf2
Failed to restart allocation:

Unexpected response code: 500 (restart of an alloc that should not run)

It is not clear how to restart failed task?

jippi · 2024-02-23T18:36:18Z

Please open a new issue for that. This issue is many years old and closed :)

dadgar added stage/thinking theme/api HTTP API and SDK issues labels Jan 22, 2016

dadgar added type/enhancement and removed stage/thinking labels Sep 27, 2016

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 3, 2021

shishir-a412ed mentioned this issue Nov 18, 2021

Nomad job restart #11533

Closed

dasavick mentioned this issue May 21, 2022

support for identifying out of date docker images #13061

Open

tgross changed the title ~~Add ability to restart running tasks/jobs~~ Add ability to restart all running tasks/allocs of a job Aug 23, 2022

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 23, 2022

tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Aug 23, 2022

valodzka pushed a commit to valodzka/nomad-workarounds that referenced this issue Jan 9, 2023

add workaround around nomad issues hashicorp/nomad#3093, hashicorp/no…

1f42f71

…mad#698, hashicorp/nomad#10727, hashicorp/nomad#10039, hashicorp/nomad#1635, hashicorp/nomad#8368 + basic readme

lgfa29 mentioned this issue Feb 28, 2023

cli: job restart command #16278

Merged

lgfa29 closed this as completed in #16278 Mar 23, 2023

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Mar 23, 2023

hc-github-team-nomad-core mentioned this issue Mar 23, 2023

Backport of cli: job restart command into release/1.5.x #16628

Merged

Add ability to restart all running tasks/allocs of a job #698

Add ability to restart all running tasks/allocs of a job #698

Comments

supernomad commented Jan 22, 2016

dadgar commented Jan 22, 2016

mkabischev commented Feb 6, 2016

gpaggi commented Apr 19, 2016

jtuthehien commented May 24, 2016

c4milo commented Jun 14, 2016

liclac commented Jul 14, 2016 • edited Loading

mohitarora commented Aug 22, 2016

dennybaa commented Sep 15, 2016

jippi commented Sep 27, 2016

xyzjace commented Jan 16, 2017

ashald commented Jan 26, 2017

rokka-n commented Jan 26, 2017

jippi commented Jan 26, 2017 • edited Loading

ashald commented Jan 26, 2017

pznamensky commented Aug 29, 2017

dansteen commented Sep 6, 2017

JewelPengin commented Sep 6, 2017 • edited Loading

danielwpz commented Sep 14, 2017

sullivanchan commented Sep 19, 2017

mildred commented Sep 19, 2017

mildred commented Sep 19, 2017

thevilledev commented Feb 22, 2018

maihde commented Mar 2, 2018

jovandeginste commented Mar 2, 2018

rkettelerij commented Mar 2, 2018

jovandeginste commented Mar 2, 2018

Oloremo commented Dec 1, 2020

datadexer commented Dec 2, 2020

OmarQunsul commented Jan 7, 2021 • edited Loading

jpasichnyk commented Feb 24, 2021

thatsk commented Apr 27, 2021

stupidlamo commented May 28, 2021

kunalsingthakur commented Jul 29, 2021

kunalsingthakur commented Jul 29, 2021

dg-eparizzi commented Aug 13, 2021

josegonzalez commented Sep 1, 2021

thatsk commented Sep 1, 2021 via email

sbrl commented Sep 3, 2021 • edited Loading

victusfate commented Sep 30, 2021

maxadamo commented Feb 7, 2022 • edited Loading

Laboltus commented Mar 12, 2022

tgross commented Aug 23, 2022

EugenKon commented Feb 23, 2024

jippi commented Feb 23, 2024

liclac commented Jul 14, 2016 •

edited

Loading

jippi commented Jan 26, 2017 •

edited

Loading

JewelPengin commented Sep 6, 2017 •

edited

Loading

OmarQunsul commented Jan 7, 2021 •

edited

Loading

sbrl commented Sep 3, 2021 •

edited

Loading

maxadamo commented Feb 7, 2022 •

edited

Loading