Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad prestart task breaks "alloc restart" #9841

Closed
johnzhanghua opened this issue Jan 18, 2021 · 7 comments · Fixed by #14127
Closed

Nomad prestart task breaks "alloc restart" #9841

johnzhanghua opened this issue Jan 18, 2021 · 7 comments · Fixed by #14127
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/task lifecycle type/bug
Milestone

Comments

@johnzhanghua
Copy link

Nomad version

Nomad v0.12.0 (8f7fbc8)

Operating system and Environment details

CentOs 7.5 VM env on Virtualbox 6.1

Issue

Nomad alloc restart not working with prestart lifecycle task

Expected behaviour, alloc restart the prestart task and then the main task.

Reproduction steps

  • Run the nomad task with the following task file
 nomad job status test
ID            = test
Name          = test
Submit Date   = 2021-01-18T11:47:41Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
test        0       0         1        0       0         0

Latest Deployment
ID          = ff4a14b1
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
test        1        1       0        0          2021-01-18T11:57:41Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
df20b14a  1d358dc0  test        0        run      running  15s ago  7s ago

  • Run nomad alloc restart
nomad alloc restart df20b14a
Failed to restart allocation:

Unexpected response code: 500 (rpc error: 2 errors occurred:
	* Task not running
	* Task not running

)

Job file (if appropriate)

job "test" {
  datacenters = ["dc1"]
  type = "service"

  group "test" {
    restart {
      interval = "6m"
      attempts = 10
      delay    = "10s"
      mode     = "delay"
    }

    # add prestart task
    task "test-pre" {
      driver = "docker"
      lifecycle {
        hook = "prestart"
        sidecar = false
      }

      config {
        image = "alpine:3.8"
        command = "sh"

        args = ["-c", "echo test > /alloc/test_file"]
      }
    }

    task "test" {
      driver = "docker"

      config {
        image = "alpine:3.8"
        command = "sh"

        args = ["-c", "if [ ! -s /alloc/test_file ]; then sleep 5; exit 1; else while sleep 3600; do :; done; fi"]
      }
    }
  }
}
@ianmdrennan
Copy link

Expected behaviour, alloc restart the prestart task and then the main task.

hmmm, just to play a bit of an alternative advocate here - this would cause a lot of unnecessary computation in our environments. For example, if we have a container template change and the container is restarted, we definitely dont want to run our prestarts again - those, in our case, are for full start/stop scenarios and if an allocation moves to a different node.

@tgross
Copy link
Member

tgross commented Jan 19, 2021

I've verified this behavior on the current HEAD, with a few other surprises.

If we run the job that @johnzhanghua provided, we get an error when we ask for a restart:

$ nomad alloc restart 4bf
Failed to restart allocation:

Unexpected response code: 500 (1 error occurred:
        * Task not running

)

But the interesting thing is that it looks like the main task does restart. For a prestart task with sidecar=false, this is arguably correct behavior as @ianmdrennan says. But it's not documented what to expect, which we should fix.

$ nomad alloc status 4bf
...

Task "test-pre" (prestart) is "dead"
Task Resources
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB

Task Events:
Started At     = 2021-01-19T13:26:02Z
Finished At    = 2021-01-19T13:26:02Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-01-19T08:26:02-05:00  Terminated  Exit Code: 0
2021-01-19T08:26:02-05:00  Started     Task started by client
2021-01-19T08:25:58-05:00  Driver      Downloading image
2021-01-19T08:25:58-05:00  Task Setup  Building Task Directory
2021-01-19T08:25:58-05:00  Received    Task received by client

Task "test" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/100 MHz  164 KiB/300 MiB  300 MiB

Task Events:
Started At     = 2021-01-19T13:26:16Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2021-01-19T13:26:16Z

Recent Events:
Time                       Type              Description
2021-01-19T08:26:16-05:00  Started           Task started by client
2021-01-19T08:26:16-05:00  Restarting        Task restarting in 0s
2021-01-19T08:26:16-05:00  Terminated        Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2021-01-19T08:26:10-05:00  Restart Signaled  User requested restart
2021-01-19T08:26:02-05:00  Started           Task started by client
2021-01-19T08:26:02-05:00  Task Setup        Building Task Directory
2021-01-19T08:25:58-05:00  Received          Task received by client

@tgross
Copy link
Member

tgross commented Feb 2, 2021

hmmm, just to play a bit of an alternative advocate here - this would cause a lot of unnecessary computation in our environments. For example, if we have a container template change and the container is restarted, we definitely dont want to run our prestarts again - those, in our case, are for full start/stop scenarios and if an allocation moves to a different node.

Hey @ianmdrennan, I just had a chat with my colleague @jazzyfresh on this. There's a subtle difference in behavior we want capture here between:

  • someone has run the nomad alloc restart :alloc_id command
  • the main task has restarted for reasons outside of Nomad's control

If you run nomad alloc restart :alloc_id, you're saying "restart this allocation", which includes the prestart tasks. If you wanted to use the same command to just restart the main task, it'd be nomad alloc restart :alloc_id :main_task_name. Whereas if the main task fails and is restarted, we definitely don't want to restart the prestart task.

We're going to be working up some documentation improvements alongside the implementation for this bug fix, which should outline the matrix of behaviors to expect.

@mr-karan
Copy link
Contributor

If you run nomad alloc restart :alloc_id, you're saying "restart this allocation", which includes the prestart tasks.

In my case if sidecar=false it doesn't restart the task. If that is indeed the correct behavior, is it possible to have a force restart in such cases? Or just a general ability to restart a dead task?

@thatsk
Copy link

thatsk commented Feb 17, 2022

same issue with me

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 29, 2022

Hi all 👋

The work to enable this has been complete and release in Nomad v1.3.4. You can use the new -all-tasks when restarting an alloc to re-run lifecycle tasks.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/task lifecycle type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants