Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alloc restart does not restart one-shot lifecycle tasks #9464

Closed
cgbaker opened this issue Nov 29, 2020 · 6 comments · Fixed by #14127
Closed

Alloc restart does not restart one-shot lifecycle tasks #9464

cgbaker opened this issue Nov 29, 2020 · 6 comments · Fixed by #14127
Assignees
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/task lifecycle type/bug

Comments

@cgbaker
Copy link
Contributor

cgbaker commented Nov 29, 2020

Nomad version

Nomad v1.0.0-beta3 (fcb32ef7ba037b2b83e7a0fd1c53c7410a2990db)

Issue

The alloc restart command optimistically attempts to restart all tasks in the allocation; it collects any task-level errors and returns them:
https://github.com/hashicorp/nomad/blob/v1.0.0-beta3/client/allocrunner/alloc_runner.go#L1173-L1178

For an allocation with a dead poststart task, the main task will be restarted, but the poststart task will fail to restart with Task not running (because it's dead).

There are a few changes we could make here:

  • There is an argument that poststart tasks should be re-run when the main tasks are restarted.
  • Otherwise, I think there's an argument that RestartAll should not return an error; for an alloc-level restart, we should not attempt to restart dead poststart tasks
  • Even if we keep the behavior the same, the error message could be improved to indicate which task failed to restart. Currently, if there are are multiple dead poststart tasks, each will return an error on restart, resulting in the increasingly unhelpful error message:
$ nomad alloc restart bab9
Failed to restart allocation:

Unexpected response code: 500 (2 errors occurred:
	* Task not running
	* Task not running

)

Reproduction steps

  1. Run the job below
  2. After the poststart allocation has stopped, run nomad alloc restart <alloc-id>
  3. Note the error message

Job file (if appropriate)

job "repro7875" {
  type = "service"
  datacenters = ["dc1"]
  group "repro" {
    task "main" {
      driver = "exec"
      config {
        command = "sleep"
        args = ["3600"]
      }  
    }
    task "poststart" {
      driver = "exec"
      config {
        command = "env"
      }
      lifecycle {
        hook = "poststart"
      }
    }
  }
}

Nomad Client logs (if appropriate)

If possible please post relevant logs in the issue.

Logs and other artifacts may also be sent to: nomad-oss-debug@hashicorp.com

Please link to your Github issue in the email and reference it in the subject
line:

To: nomad-oss-debug@hashicorp.com

Subject: GH-1234: Errors garbage collecting allocs

Emails sent to that address are readable by all HashiCorp employees but are not publicly visible.

Nomad Server logs (if appropriate)

@cgbaker cgbaker added this to the 1.0 milestone Nov 29, 2020
@tgross tgross removed this from the 1.0 milestone Dec 4, 2020
@tgross tgross added this to In Progress in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross removed this from In Progress in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross added this to the 1.0.4 milestone Feb 18, 2021
@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug and removed stage/needs-discussion labels Feb 22, 2021
@tgross tgross removed this from the 1.0.4 milestone Feb 24, 2021
@isabeldepapel isabeldepapel self-assigned this Jun 15, 2021
@tgross tgross changed the title Alloc restart does not restart post-start tasks Alloc restart does not restart one-shot lifecycle tasks Jul 7, 2021
@marcofiocco
Copy link

I'm using Nomad 1.0.4 server and I got a similar issue. Nomad decided to restart the job on its own. The main task is fine, but the poststart task has not been run.

@marcofiocco
Copy link

If it helps, it seems that the main task has been restarted because of Failed due to progress deadline

@sriyer
Copy link

sriyer commented Dec 10, 2021

We see the same problem even with nomad 1.1.6 . @cgbaker @tgross is there a plan to get this fixed ?

of the options listed,
"There is an argument that poststart tasks should be re-run when the main tasks are restarted." would be a good one to have.

@tgross
Copy link
Member

tgross commented Dec 10, 2021

Hi @sriyer. We got started with the work in #10785 but it didn't get completed. It's on-deck on the roadmap but I can't give you an exact release window for it.

@dhung-hashicorp dhung-hashicorp added the hcc/cst Admin - internal label Dec 12, 2021
@tgross tgross removed their assignment May 19, 2022
@lgfa29
Copy link
Contributor

lgfa29 commented Aug 29, 2022

Hi all 👋

The work to enable this has been complete and release in Nomad v1.3.4. You can use the new -all-tasks when restarting an alloc to re-run lifecycle tasks.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/task lifecycle type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants