Task lifecycle restart #14127

lgfa29 · 2022-08-16T02:18:01Z

Following-up on the work done in #14009, this PR implements a new restart mode that allows for all tasks of an allocation to be restarted, even those that have already run, such as non-sidecar prestart or poststop tasks.

It also solves some related issues, such as the restart command failing with Task not running due to dead tasks in the allocation. These errors are now ignore when restarting the allocation (restarting a dead task with -task will still result in this error).

Closes #9464
Closes #9688
Closes #9841

Note to reviewers: The internal RFC proposed a new task state (complete) to differentiate between a task that is really dead (and would never run again) from that a task that finished running, but is waiting for a restart.

But during implementation I noticed that there are quite a few places where the dead state was being checked, but then I also noticed that it didn't really matter if the task was dead or complete so I was able to implement this functionally without the need for the extra state.

lgfa29 · 2022-08-18T01:44:41Z

api/allocations.go

@@ -127,6 +127,16 @@ func (a *Allocations) Restart(alloc *Allocation, taskName string, q *QueryOption
 	return err
 }

+func (a *Allocations) RestartAllTasks(alloc *Allocation, q *QueryOptions) error {


I created a new method here to avoid an api package breaking change in the Restart method above.

client/allocrunner/alloc_runner.go

The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again.

tgross

LGTM!

client/allocrunner/taskrunner/task_runner_test.go

DerekStrickland

Amazing work! Left some questions/suggestions, but not blocking.

client/allocrunner/alloc_runner.go

DerekStrickland · 2022-08-23T11:31:26Z

client/allocrunner/alloc_runner.go

+// The event type will be set to TaskRestartRunningSignal to comply with
+// internal restart logic requirements.
+func (ar *allocRunner) Restart(ctx context.Context, event *structs.TaskEvent, failure bool) error {
+	if event.Type != structs.TaskRestartRunningSignal {


If the event.Type isn't TaskRestartRunningSignal should we emit the incoming event to TaskState and create a new event, rather than overwriting it? It looks like only the check watcher calls this method, and it uses the the correct event.Type but maybe it would be a little more future proof to emit and create?

Hum..not sure if I understood, do you mean something like this?

func (ar *allocRunner) Restart(ctx context.Context, event *structs.TaskEvent, failure bool) error { ev := event.Copy() ev.Type = structs.TaskRestartRunningSignal return ar.restartTasks(ctx, ev, failure) }

I thought the original reason we were overwriting was a form of assertion? If it's not the right task restart signal, it's a bug?

Yeah, my understanding is that, for the caller of the allocRunner.Restart* functions, the event type doesn't matter, so overwriting it was a way to "correct" the call if, for some reason, it was done using the wrong type.

There is no valid scenario where you would call, for example, RestartRunning with anything other than the TaskRestartRunningSignal event type, so instead of returning an error we correct the call.

The Restart method specifically is an interesting one. It is used to implement the WorkloadRestarter interface (defined here and here) so, in theory, it could be used to perform different types of restart, but I chose to keep it as-is (restart only tasks that are currently running).

That's almost what I meant. I was suggesting we make the copy and emit both events. I was thinking that with this implementation you lose the info on which code path triggered the restart.

A few things to note here:

task events are used to surface information to users, they are not something like code tracing.

event types have a fairly restrict set of possible values, so they are not the best to help trace code path.

events have other, more meaningful fields, like DisplayMessage, Details, ReastartReason etc. and those will be kept from the input.

task events are stored in state, and we only keep the last 10, so emitting (almost) duplicate events will take up a slot and provide little value to users.

That being said, my implementation does abuse the task event mechanism a bit. I tried to think of an alternative, but couldn't think of one yet. I will take another look to see if I can think of something else.

e620cd1 removes all of this task event nonsense 😅

I also got ride of the new task event types to avoid any confusion on the developer side as to which type to use. I think it's easy enough for users to distinguish between them given their event description:

client/allocrunner/alloc_runner_test.go

DerekStrickland · 2022-08-23T11:52:40Z

client/allocrunner/tasklifecycle/coordinator.go

@@ -277,7 +293,7 @@ func (c *Coordinator) isPrestartDone(states map[string]*structs.TaskState) bool
 	}

 	for _, task := range c.tasksByLifecycle[lifecycleStagePrestartEphemeral] {
-		if states[task].State != structs.TaskStateDead || states[task].Failed {
+		if !states[task].Successful() {


DerekStrickland · 2022-08-23T11:54:10Z

client/allocrunner/taskrunner/lifecycle.go

@@ -9,25 +9,63 @@ import (
 // Restart a task. Returns immediately if no task is running. Blocks until
 // existing task exits or passed-in context is canceled.
 func (tr *TaskRunner) Restart(ctx context.Context, event *structs.TaskEvent, failure bool) error {
-	tr.logger.Trace("Restart requested", "failure", failure)
+	tr.logger.Trace("Restart requested", "failure", failure, "event", event.GoString())


DerekStrickland · 2022-08-23T11:56:23Z

client/allocrunner/taskrunner/lifecycle.go

+		// all tasks in the alloc, otherwise the taskCoordinator will prevent
+		// it from running again, and if their Run method is still running.
+		if event.Type != structs.TaskRestartAllSignal || localState.RunComplete {
+			return ErrTaskNotRunning


Would a distinct error type that indicates this condition be helpful upstream? If not, please ignore 😃

Hum...I think the error is the same, that task tried to restart but it wasn't running

client/allocrunner/taskrunner/task_runner.go

Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again.

* allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes

* allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

github-actions · 2022-12-23T02:15:05Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

vercel bot deployed to Preview – nomad August 16, 2022 02:21 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 16, 2022 02:22 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 16, 2022 17:18 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 16, 2022 17:44 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 16, 2022 18:15 View deployment

lgfa29 force-pushed the task-lifecycle-restart branch from 859b089 to 4081bc7 Compare August 16, 2022 19:48

vercel bot deployed to Preview – nomad-storybook-and-ui August 16, 2022 19:51 View deployment

lgfa29 force-pushed the task-lifecycle-restart branch from 4081bc7 to 0bec51f Compare August 16, 2022 20:19

vercel bot deployed to Preview – nomad-storybook-and-ui August 16, 2022 20:22 View deployment

lgfa29 force-pushed the task-lifecycle-restart branch from 0bec51f to 464279e Compare August 16, 2022 22:27

vercel bot deployed to Preview – nomad-storybook-and-ui August 17, 2022 00:05 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 17, 2022 01:43 View deployment

lgfa29 force-pushed the task-lifecycle-restart branch from dbcea53 to 7a1415c Compare August 17, 2022 02:09

vercel bot deployed to Preview – nomad-storybook-and-ui August 17, 2022 02:13 View deployment

lgfa29 force-pushed the task-lifecycle-restart branch from 7a1415c to d57003e Compare August 17, 2022 18:22

vercel bot deployed to Preview – nomad-storybook-and-ui August 17, 2022 18:26 View deployment

lgfa29 force-pushed the task-lifecycle-restart branch from d57003e to 9f7234f Compare August 17, 2022 19:55

vercel bot deployed to Preview – nomad-storybook-and-ui August 17, 2022 19:58 View deployment

lgfa29 force-pushed the task-lifecycle-restart branch from 9f7234f to 3e99369 Compare August 17, 2022 23:41

vercel bot deployed to Preview – nomad-storybook-and-ui August 17, 2022 23:45 View deployment

lgfa29 force-pushed the task-lifecycle-restart branch from 3e99369 to a8e38ce Compare August 18, 2022 00:23

vercel bot deployed to Preview – nomad-storybook-and-ui August 18, 2022 00:26 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 18, 2022 00:58 View deployment

This was referenced Aug 18, 2022

Lifecycle Restart #12939

Closed

lifecycle: unit test for lifecycle task behavior on restarts #10785

Closed

lgfa29 force-pushed the task-lifecycle-restart branch from b294b97 to 4ac9de1 Compare August 18, 2022 01:42

lgfa29 added the backport/1.3.x backport to 1.3.x release line label Aug 18, 2022

lgfa29 commented Aug 18, 2022

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui August 18, 2022 01:46 View deployment

lgfa29 commented Aug 18, 2022

View reviewed changes

client/allocrunner/alloc_runner.go Outdated Show resolved Hide resolved

lgfa29 added 3 commits August 22, 2022 18:42

changelog: add entry for #14127

4400c37

address code review requests

c56cfdb

lgfa29 force-pushed the task-lifecycle-restart branch from 8b813fd to c56cfdb Compare August 22, 2022 22:44

lgfa29 added this to the 1.3.4 milestone Aug 22, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui August 22, 2022 22:50 View deployment

tgross approved these changes Aug 23, 2022

View reviewed changes

client/allocrunner/taskrunner/task_runner_test.go Outdated Show resolved Hide resolved

client/allocrunner/taskrunner/task_runner_test.go Show resolved Hide resolved

DerekStrickland approved these changes Aug 23, 2022

View reviewed changes

lgfa29 added a commit that referenced this pull request Aug 24, 2022

changelog: add entry for #14127

fc54d60

apply more code review changes

647f071

vercel bot deployed to Preview – nomad-storybook-and-ui August 24, 2022 02:35 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 24, 2022 19:56 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 24, 2022 20:48 View deployment

vercel bot deployed to Preview – nomad August 24, 2022 20:48 View deployment

minor fixes

8037a17

lgfa29 force-pushed the task-lifecycle-restart branch from ef6ae16 to 8037a17 Compare August 24, 2022 21:07

vercel bot deployed to Preview – nomad August 24, 2022 21:10 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui August 24, 2022 21:11 View deployment

lgfa29 merged commit f74f508 into main Aug 24, 2022

lgfa29 deleted the task-lifecycle-restart branch August 24, 2022 21:43

hc-github-team-nomad-core mentioned this pull request Aug 24, 2022

Backport of Task lifecycle restart into release/1.3.x #14312

Merged

shoenig mentioned this pull request Sep 26, 2022

Service registry not being updated for alloc restarts #13802

Closed

lgfa29 mentioned this pull request Nov 30, 2022

Prestart / Init task continues to render templates #15419

Open

github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022

angrycub restored the task-lifecycle-restart branch January 5, 2023 22:26

angrycub deleted the task-lifecycle-restart branch January 5, 2023 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task lifecycle restart #14127

Task lifecycle restart #14127

lgfa29 commented Aug 16, 2022 •

edited

Loading

lgfa29 Aug 18, 2022

tgross left a comment

DerekStrickland left a comment

DerekStrickland Aug 23, 2022

lgfa29 Aug 23, 2022

tgross Aug 23, 2022

lgfa29 Aug 23, 2022

DerekStrickland Aug 24, 2022

lgfa29 Aug 24, 2022

lgfa29 Aug 24, 2022

DerekStrickland Aug 23, 2022

DerekStrickland Aug 23, 2022

DerekStrickland Aug 23, 2022

lgfa29 Aug 23, 2022

github-actions bot commented Dec 23, 2022

Task lifecycle restart #14127

Task lifecycle restart #14127

Conversation

lgfa29 commented Aug 16, 2022 • edited Loading

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

DerekStrickland left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 23, 2022

lgfa29 commented Aug 16, 2022 •

edited

Loading