Nomad prestart task breaks "alloc restart" #9841

johnzhanghua · 2021-01-18T12:51:37Z

Nomad version

Nomad v0.12.0 (8f7fbc8)

Operating system and Environment details

CentOs 7.5 VM env on Virtualbox 6.1

Issue

Nomad alloc restart not working with prestart lifecycle task

Expected behaviour, alloc restart the prestart task and then the main task.

Reproduction steps

Run the nomad task with the following task file

 nomad job status test
ID            = test
Name          = test
Submit Date   = 2021-01-18T11:47:41Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
test        0       0         1        0       0         0

Latest Deployment
ID          = ff4a14b1
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
test        1        1       0        0          2021-01-18T11:57:41Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
df20b14a  1d358dc0  test        0        run      running  15s ago  7s ago

Run nomad alloc restart

nomad alloc restart df20b14a
Failed to restart allocation:

Unexpected response code: 500 (rpc error: 2 errors occurred:
	* Task not running
	* Task not running

)

Job file (if appropriate)

job "test" {
  datacenters = ["dc1"]
  type = "service"

  group "test" {
    restart {
      interval = "6m"
      attempts = 10
      delay    = "10s"
      mode     = "delay"
    }

    # add prestart task
    task "test-pre" {
      driver = "docker"
      lifecycle {
        hook = "prestart"
        sidecar = false
      }

      config {
        image = "alpine:3.8"
        command = "sh"

        args = ["-c", "echo test > /alloc/test_file"]
      }
    }

    task "test" {
      driver = "docker"

      config {
        image = "alpine:3.8"
        command = "sh"

        args = ["-c", "if [ ! -s /alloc/test_file ]; then sleep 5; exit 1; else while sleep 3600; do :; done; fi"]
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

ianmdrennan · 2021-01-18T18:18:15Z

Expected behaviour, alloc restart the prestart task and then the main task.

hmmm, just to play a bit of an alternative advocate here - this would cause a lot of unnecessary computation in our environments. For example, if we have a container template change and the container is restarted, we definitely dont want to run our prestarts again - those, in our case, are for full start/stop scenarios and if an allocation moves to a different node.

tgross · 2021-01-19T13:34:15Z

I've verified this behavior on the current HEAD, with a few other surprises.

If we run the job that @johnzhanghua provided, we get an error when we ask for a restart:

$ nomad alloc restart 4bf
Failed to restart allocation:

Unexpected response code: 500 (1 error occurred:
        * Task not running

)

But the interesting thing is that it looks like the main task does restart. For a prestart task with sidecar=false, this is arguably correct behavior as @ianmdrennan says. But it's not documented what to expect, which we should fix.

$ nomad alloc status 4bf
...

Task "test-pre" (prestart) is "dead"
Task Resources
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB

Task Events:
Started At     = 2021-01-19T13:26:02Z
Finished At    = 2021-01-19T13:26:02Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-01-19T08:26:02-05:00  Terminated  Exit Code: 0
2021-01-19T08:26:02-05:00  Started     Task started by client
2021-01-19T08:25:58-05:00  Driver      Downloading image
2021-01-19T08:25:58-05:00  Task Setup  Building Task Directory
2021-01-19T08:25:58-05:00  Received    Task received by client

Task "test" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/100 MHz  164 KiB/300 MiB  300 MiB

Task Events:
Started At     = 2021-01-19T13:26:16Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2021-01-19T13:26:16Z

Recent Events:
Time                       Type              Description
2021-01-19T08:26:16-05:00  Started           Task started by client
2021-01-19T08:26:16-05:00  Restarting        Task restarting in 0s
2021-01-19T08:26:16-05:00  Terminated        Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2021-01-19T08:26:10-05:00  Restart Signaled  User requested restart
2021-01-19T08:26:02-05:00  Started           Task started by client
2021-01-19T08:26:02-05:00  Task Setup        Building Task Directory
2021-01-19T08:25:58-05:00  Received          Task received by client

tgross · 2021-02-02T19:39:21Z

hmmm, just to play a bit of an alternative advocate here - this would cause a lot of unnecessary computation in our environments. For example, if we have a container template change and the container is restarted, we definitely dont want to run our prestarts again - those, in our case, are for full start/stop scenarios and if an allocation moves to a different node.

Hey @ianmdrennan, I just had a chat with my colleague @jazzyfresh on this. There's a subtle difference in behavior we want capture here between:

someone has run the nomad alloc restart :alloc_id command
the main task has restarted for reasons outside of Nomad's control

If you run nomad alloc restart :alloc_id, you're saying "restart this allocation", which includes the prestart tasks. If you wanted to use the same command to just restart the main task, it'd be nomad alloc restart :alloc_id :main_task_name. Whereas if the main task fails and is restarted, we definitely don't want to restart the prestart task.

We're going to be working up some documentation improvements alongside the implementation for this bug fix, which should outline the matrix of behaviors to expect.

mr-karan · 2021-10-12T07:52:46Z

If you run nomad alloc restart :alloc_id, you're saying "restart this allocation", which includes the prestart tasks.

In my case if sidecar=false it doesn't restart the task. If that is indeed the correct behavior, is it possible to have a force restart in such cases? Or just a general ability to restart a dead task?

thatsk · 2022-02-17T11:59:41Z

same issue with me

…eady to exit #9841

lgfa29 · 2022-08-29T21:55:26Z

Hi all 👋

The work to enable this has been complete and release in Nomad v1.3.4. You can use the new -all-tasks when restarting an alloc to re-run lifecycle tasks.

github-actions · 2022-12-28T02:13:37Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/task lifecycle type/bug stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jan 19, 2021

tgross mentioned this issue Jan 19, 2021

Nomad lifecyle prestart/non-sidecar task not restarting(before main task) when reboot node #9840

Open

jazzyfresh self-assigned this Feb 2, 2021

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Feb 12, 2021

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Feb 12, 2021

Oloremo mentioned this issue Apr 16, 2021

[feature request] Force(fast) job restart #10391

Closed

drewbailey unassigned jazzyfresh May 13, 2021

drewbailey moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage May 13, 2021

drewbailey mentioned this issue May 13, 2021

Restarting allocations does not seems to respect lifecycle and shudown_delay constraints #10578

Open

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage May 13, 2021

tgross mentioned this issue Jun 18, 2021

lifecycle: unit test for lifecycle task behavior on restarts #10785

Closed

mr-karan mentioned this issue Oct 12, 2021

Ability to always download artifacts #11297

Open

jazzyfresh self-assigned this Nov 29, 2021

jazzyfresh added a commit that referenced this issue May 10, 2022

lifecycle: block prestart taskrunners from exiting until allocation r…

2305064

…eady to exit #9841

jazzyfresh pushed a commit that referenced this issue May 10, 2022

lifecycle: block prestart taskrunners from exiting until allocation r…

57d05e4

…eady to exit #9841

mmcquillan added this to the 1.3.x milestone May 17, 2022

lgfa29 mentioned this issue Aug 18, 2022

Task lifecycle restart #14127

Merged

lgfa29 closed this as completed in #14127 Aug 24, 2022

hc-github-team-nomad-core mentioned this issue Aug 24, 2022

Backport of Task lifecycle restart into release/1.3.x #14312

Merged

tgross modified the milestones: 1.3.x, 1.3.4 Aug 26, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad prestart task breaks "alloc restart" #9841

Nomad prestart task breaks "alloc restart" #9841

johnzhanghua commented Jan 18, 2021

ianmdrennan commented Jan 18, 2021

tgross commented Jan 19, 2021

tgross commented Feb 2, 2021

mr-karan commented Oct 12, 2021

thatsk commented Feb 17, 2022

lgfa29 commented Aug 29, 2022

github-actions bot commented Dec 28, 2022

Nomad prestart task breaks "alloc restart" #9841

Nomad prestart task breaks "alloc restart" #9841

Comments

johnzhanghua commented Jan 18, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

ianmdrennan commented Jan 18, 2021

tgross commented Jan 19, 2021

tgross commented Feb 2, 2021

mr-karan commented Oct 12, 2021

thatsk commented Feb 17, 2022

lgfa29 commented Aug 29, 2022

github-actions bot commented Dec 28, 2022