allocrunner: refactor task coordinator #14009

lgfa29 · 2022-08-04T14:32:37Z

The current implementation for the task coordinator unblocks tasks by
performing destructive operations over its internal state (like closing
channels and deleting maps from keys).

This presents a problem in situations where we would like to revert the
state of a task, such as when restarting an allocation with tasks that
have already exited.

With this new implementation the task coordinator behaves more like a
finite state machine where task may be blocked/unblocked multiple times
by performing a state transition.

This initial part of the work only refactors the task coordinator and
is functionally equivalent to the previous implementation. Future work
will build upon this to provide bug fixes and enhancements.

Note to reviewers: the implementation ended up a little different from the internal RFC description. During tests I noticed that there were more state transitions than I first thought about and the mechanism of closing channels became brittle.

It was very easy to accidentally close a channel twice, which meant that we had to be very careful during state transitions. This resulted in an implicit dependency between a state and the path taken to reach it (e.g., if coming from the X state don't close channel Y).

So I opted for an earlier idea which is to have channels with a producer that can be disabled on demand. This makes the operation idempotent, so we don't have to worry about previous states during state transition, but it adds a bit more code and could have some other problems I may have missed.

The current implementation for the task coordinator unblocks tasks by performing destructive operations over its internal state (like closing channels and deleting maps from keys). This presents a problem in situations where we would like to revert the state of a task, such as when restarting an allocation with tasks that have already exited. With this new implementation the task coordinator behaves more like a finite state machine where task may be blocked/unblocked multiple times by performing a state transition. This initial part of the work only refactors the task coordinator and is functionally equivalent to the previous implementation. Future work will build upon this to provide bug fixes and enhancements.

lgfa29 · 2022-08-04T14:35:10Z

client/allocrunner/task_coordinator_test.go

+	default:
+		return true
+	}
+}


Since this is just a refactoring of existing code, I tried to keep the behaviour the same, including the tests, but the file rename created a new diff. Here's the old vs. new test diff: https://gist.github.com/lgfa29/6c7d7b6590a9ea377ad12e886c3bfd4c

client/allocrunner/alloc_runner_test.go

client/allocrunner/task_coordinator.go

shoenig · 2022-08-04T18:45:18Z

@lgfa29 any chance you've run this under the Go race detector?

lgfa29 · 2022-08-04T19:31:15Z

@lgfa29 any chance you've run this under the Go race detector?

I have not, but that's a great idea. I have a janky random task state generator that I used to find some subtle bugs, I will run it with the race detector activated.

tgross

@lgfa29 this is a great PR and will make a huge improvement in comprehensibility of this tricky area of the client. I've left some comments for discussion.

client/allocrunner/task_coordinator.go

client/allocrunner/task_coordinator_controller.go

tgross · 2022-08-05T13:57:49Z

client/allocrunner/alloc_runner.go

+	// Start and wait for all tasks.
 	for _, task := range ar.tasks {
 		go task.Run()
 	}
-
-	// Block on all tasks except poststop tasks
 	for _, task := range ar.tasks {
-		if !task.IsPoststopTask() {
-			<-task.WaitCh()
-		}
-	}
-
-	// Signal poststop tasks to proceed to main runtime
-	ar.taskHookCoordinator.StartPoststopTasks()
-
-	// Wait for poststop tasks to finish before proceeding
-	for _, task := range ar.tasks {
-		if task.IsPoststopTask() {
-			<-task.WaitCh()
-		}
+		<-task.WaitCh()


This section immediately shows the value of this approach 👍

tgross · 2022-08-05T14:15:55Z

client/allocrunner/task_coordinator.go

+
+// taskStateUpdated notifies that a task state has changed. This may cause the
+// taskCoordinator FSM to progresses to another state.
+func (c *taskCoordinator) taskStateUpdated(states map[string]*structs.TaskState) {


A note to any other reviewers: these *structs.TaskState pointers are to copies of the task state created in allocrunner

client/allocrunner/task_coordinator_test.go

The initial implementation wasn't very clear as to how all pieces were connected together. This was the result of bad naming and encapsulation. This commits fixes this by moving the new structs into their own package since they are isolated from the rest of the allocrunner. Moving them into their own package allows for simpler names that don't have to repeat the task* prefix all the time and for better interfaces, where methods that are expected to be called by external components are now public and internal methods remain private. The package also has a doc.go file with more extensive documentation. The taskCoordinatorController struct was also poorly named (controller is too generic) so it was renamed to simply Gate as it better reflects its controls and behaviour.

A poststart task can have sidecar set to true or false. This difference is usually not relevant when coordinating task start order: even with a restart command, tasks only run once. But when a taskRunner is recovered after a Nomad agent restarts, the Coordinator must block postart non-sidecar tasks from running again.

lgfa29 · 2022-08-11T00:34:05Z

@shoenig @tgross @schmichael I'm done fiddling with this PR now 😄

shoenig

LGTM! The improvement in readability is 💯

One last thought - would it be worth holding off on backporting until it has time to bake in the 1.4 beta? We could still backport it to 1.3.x on the 1.4 release; this just seems a bit much to go into a bugfix backport directly.

client/allocrunner/tasklifecycle/coordinator.go

client/allocrunner/tasklifecycle/doc.go

tgross

Nice work here @lgfa29!

client/allocrunner/tasklifecycle/gate.go

schmichael

testing package comment is the only blocker I think. Great work!

client/allocrunner/tasklifecycle/coordinator.go

client/allocrunner/tasklifecycle/gate.go

client/allocrunner/tasklifecycle/testing.go

github-actions · 2022-12-21T02:13:59Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

lgfa29 commented Aug 4, 2022

View reviewed changes

lgfa29 requested review from shoenig, schmichael and tgross August 4, 2022 14:36

lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line and removed backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Aug 4, 2022

shoenig reviewed Aug 4, 2022

View reviewed changes

client/allocrunner/alloc_runner_test.go Outdated Show resolved Hide resolved

client/allocrunner/task_coordinator.go Outdated Show resolved Hide resolved

client/allocrunner/task_coordinator.go Outdated Show resolved Hide resolved

tgross reviewed Aug 5, 2022

View reviewed changes

lgfa29 added 3 commits August 8, 2022 17:39

refactor TestHasSidecarTasks

c9706c9

fix declaration of constants with custom type

c22b0f2

vercel bot deployed to Preview – nomad-storybook-and-ui August 10, 2022 02:06 View deployment

minor fixes

eabc168

vercel bot deployed to Preview – nomad-storybook-and-ui August 10, 2022 02:19 View deployment

lgfa29 added 2 commits August 10, 2022 12:31

make godoc diagram dashed line more distinguishible

15be12d

fix tests

b4b4607

vercel bot deployed to Preview – nomad-storybook-and-ui August 10, 2022 18:06 View deployment

lgfa29 added 2 commits August 10, 2022 19:38

add tests for gate with closed shutdownCh

124d296

vercel bot deployed to Preview – nomad-storybook-and-ui August 11, 2022 00:12 View deployment

expand comment

f38dd03

vercel bot deployed to Preview – nomad-storybook-and-ui August 11, 2022 00:55 View deployment

shoenig approved these changes Aug 11, 2022

View reviewed changes

client/allocrunner/tasklifecycle/coordinator.go Outdated Show resolved Hide resolved

client/allocrunner/tasklifecycle/doc.go Show resolved Hide resolved

tgross approved these changes Aug 15, 2022

View reviewed changes

client/allocrunner/tasklifecycle/gate.go Show resolved Hide resolved

lgfa29 mentioned this pull request Aug 18, 2022

Task lifecycle restart #14127

Merged

schmichael approved these changes Aug 18, 2022

View reviewed changes

address code review

5b60175

vercel bot deployed to Preview – nomad-storybook-and-ui August 22, 2022 17:06 View deployment

improve godoc for Gate.WaitCh()

8b57e94

vercel bot deployed to Preview – nomad-storybook-and-ui August 22, 2022 19:08 View deployment

lgfa29 merged commit 6070fa0 into main Aug 22, 2022

lgfa29 deleted the task-lifecycle-fsm branch August 22, 2022 22:38

hc-github-team-nomad-core mentioned this pull request Aug 22, 2022

Backport of allocrunner: refactor task coordinator into release/1.3.x #14225

Merged

lgfa29 added this to the 1.3.4 milestone Aug 22, 2022

hc-github-team-nomad-core mentioned this pull request Aug 24, 2022

Backport of Task lifecycle restart into release/1.3.x #14312

Merged

shoenig mentioned this pull request Sep 26, 2022

Service registry not being updated for alloc restarts #13802

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allocrunner: refactor task coordinator #14009

allocrunner: refactor task coordinator #14009

lgfa29 commented Aug 4, 2022

lgfa29 Aug 4, 2022

shoenig commented Aug 4, 2022

lgfa29 commented Aug 4, 2022

tgross left a comment

tgross Aug 5, 2022

tgross Aug 5, 2022

lgfa29 commented Aug 11, 2022

shoenig left a comment

tgross left a comment

schmichael left a comment

github-actions bot commented Dec 21, 2022

allocrunner: refactor task coordinator #14009

allocrunner: refactor task coordinator #14009

Conversation

lgfa29 commented Aug 4, 2022

lgfa29 Aug 4, 2022

Choose a reason for hiding this comment

shoenig commented Aug 4, 2022

lgfa29 commented Aug 4, 2022

tgross left a comment

Choose a reason for hiding this comment

tgross Aug 5, 2022

Choose a reason for hiding this comment

tgross Aug 5, 2022

Choose a reason for hiding this comment

lgfa29 commented Aug 11, 2022

shoenig left a comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 21, 2022