Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocrunner: refactor task coordinator #14009

Merged
merged 12 commits into from
Aug 22, 2022
Merged

allocrunner: refactor task coordinator #14009

merged 12 commits into from
Aug 22, 2022

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Aug 4, 2022

The current implementation for the task coordinator unblocks tasks by
performing destructive operations over its internal state (like closing
channels and deleting maps from keys).

This presents a problem in situations where we would like to revert the
state of a task, such as when restarting an allocation with tasks that
have already exited.

With this new implementation the task coordinator behaves more like a
finite state machine where task may be blocked/unblocked multiple times
by performing a state transition.

This initial part of the work only refactors the task coordinator and
is functionally equivalent to the previous implementation. Future work
will build upon this to provide bug fixes and enhancements.


Note to reviewers: the implementation ended up a little different from the internal RFC description. During tests I noticed that there were more state transitions than I first thought about and the mechanism of closing channels became brittle.

It was very easy to accidentally close a channel twice, which meant that we had to be very careful during state transitions. This resulted in an implicit dependency between a state and the path taken to reach it (e.g., if coming from the X state don't close channel Y).

So I opted for an earlier idea which is to have channels with a producer that can be disabled on demand. This makes the operation idempotent, so we don't have to worry about previous states during state transition, but it adds a bit more code and could have some other problems I may have missed.

The current implementation for the task coordinator unblocks tasks by
performing destructive operations over its internal state (like closing
channels and deleting maps from keys).

This presents a problem in situations where we would like to revert the
state of a task, such as when restarting an allocation with tasks that
have already exited.

With this new implementation the task coordinator behaves more like a
finite state machine where task may be blocked/unblocked multiple times
by performing a state transition.

This initial part of the work only refactors the task coordinator and
is functionally equivalent to the previous implementation. Future work
will build upon this to provide bug fixes and enhancements.
default:
return true
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just a refactoring of existing code, I tried to keep the behaviour the same, including the tests, but the file rename created a new diff. Here's the old vs. new test diff: https://gist.github.com/lgfa29/6c7d7b6590a9ea377ad12e886c3bfd4c

@lgfa29 lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line and removed backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Aug 4, 2022
client/allocrunner/alloc_runner_test.go Outdated Show resolved Hide resolved
client/allocrunner/task_coordinator.go Outdated Show resolved Hide resolved
client/allocrunner/task_coordinator.go Outdated Show resolved Hide resolved
@shoenig
Copy link
Member

shoenig commented Aug 4, 2022

@lgfa29 any chance you've run this under the Go race detector?

@lgfa29
Copy link
Contributor Author

lgfa29 commented Aug 4, 2022

@lgfa29 any chance you've run this under the Go race detector?

I have not, but that's a great idea. I have a janky random task state generator that I used to find some subtle bugs, I will run it with the race detector activated.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgfa29 this is a great PR and will make a huge improvement in comprehensibility of this tricky area of the client. I've left some comments for discussion.

client/allocrunner/task_coordinator.go Outdated Show resolved Hide resolved
client/allocrunner/task_coordinator_controller.go Outdated Show resolved Hide resolved
client/allocrunner/task_coordinator_controller.go Outdated Show resolved Hide resolved
client/allocrunner/task_coordinator_controller.go Outdated Show resolved Hide resolved
Comment on lines +393 to +398
// Start and wait for all tasks.
for _, task := range ar.tasks {
go task.Run()
}

// Block on all tasks except poststop tasks
for _, task := range ar.tasks {
if !task.IsPoststopTask() {
<-task.WaitCh()
}
}

// Signal poststop tasks to proceed to main runtime
ar.taskHookCoordinator.StartPoststopTasks()

// Wait for poststop tasks to finish before proceeding
for _, task := range ar.tasks {
if task.IsPoststopTask() {
<-task.WaitCh()
}
<-task.WaitCh()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section immediately shows the value of this approach 👍


// taskStateUpdated notifies that a task state has changed. This may cause the
// taskCoordinator FSM to progresses to another state.
func (c *taskCoordinator) taskStateUpdated(states map[string]*structs.TaskState) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note to any other reviewers: these *structs.TaskState pointers are to copies of the task state created in allocrunner

client/allocrunner/task_coordinator_test.go Outdated Show resolved Hide resolved
client/allocrunner/task_coordinator_test.go Outdated Show resolved Hide resolved
client/allocrunner/task_coordinator_test.go Outdated Show resolved Hide resolved
The initial implementation wasn't very clear as to how all pieces were
connected together. This was the result of bad naming and encapsulation.
This commits fixes this by moving the new structs into their own package
since they are isolated from the rest of the allocrunner.

Moving them into their own package allows for simpler names that don't
have to repeat the task* prefix all the time and for better interfaces,
where methods that are expected to be called by external components are
now public and internal methods remain private. The package also has a
doc.go file with more extensive documentation.

The taskCoordinatorController struct was also poorly named (controller
is too generic) so it was renamed to simply Gate as it better reflects
its controls and behaviour.
A poststart task can have sidecar set to true or false. This difference
is usually not relevant when coordinating task start order: even with a
restart command, tasks only run once.

But when a taskRunner is recovered after a Nomad agent restarts, the
Coordinator must block postart non-sidecar tasks from running again.
@lgfa29
Copy link
Contributor Author

lgfa29 commented Aug 11, 2022

@shoenig @tgross @schmichael I'm done fiddling with this PR now 😄

Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The improvement in readability is 💯

One last thought - would it be worth holding off on backporting until it has time to bake in the 1.4 beta? We could still backport it to 1.3.x on the 1.4 release; this just seems a bit much to go into a bugfix backport directly.

client/allocrunner/tasklifecycle/coordinator.go Outdated Show resolved Hide resolved
client/allocrunner/tasklifecycle/doc.go Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here @lgfa29!

client/allocrunner/tasklifecycle/gate.go Show resolved Hide resolved
@lgfa29 lgfa29 mentioned this pull request Aug 18, 2022
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing package comment is the only blocker I think. Great work!

client/allocrunner/tasklifecycle/coordinator.go Outdated Show resolved Hide resolved
client/allocrunner/tasklifecycle/coordinator.go Outdated Show resolved Hide resolved
client/allocrunner/tasklifecycle/gate.go Show resolved Hide resolved
client/allocrunner/tasklifecycle/gate.go Outdated Show resolved Hide resolved
client/allocrunner/tasklifecycle/testing.go Outdated Show resolved Hide resolved
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.3.x backport to 1.3.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants