Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lifecycle: add poststop hook #8194

Merged
merged 2 commits into from
Nov 12, 2020
Merged

lifecycle: add poststop hook #8194

merged 2 commits into from
Nov 12, 2020

Conversation

jazzyfresh
Copy link
Contributor

@jazzyfresh jazzyfresh commented Jun 17, 2020

For #8193

Overview

A poststop hook has been added to the lifecycle stanza to allow users to run tasks after all other tasks in an allocation have completed.

Behavior

This poststop task will run after main tasks are finished running. It will run no matter the final status of the main tasks, whether it is completed, failed or killed.

  • batch job: poststop waits until main tasks have finished before starting
  • service & system jobs: since services & system jobs are long-lived, poststop will only run poststop tasks after the allocation receives a kill signal or a main task fails

Future Development

I plan to make these improvements in separate pull requests

  • increased e2e testing
  • refactor to move behavior out of the allocrunner Run() & back into the TaskHookCoordinator
  • expose prior task state to poststop task << this requires a ticket for usability research & user feedback (we could have an environment variable and also put the full task state into the alloc filesystem, but we need to figure out what a user would expect)

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having tests would be great. I'm also unclear what the semantics of post-stop is and/or what use cases we want to target - It's unclear to me when they should run?

  • Should they run if the main tasks fail? Clean up tasks should probably run all the time
  • Should they run if nomad job stop is invoked? I think current implementation would probably not run them
  • Should sidecars run concurrently with post-stop tasks? Having side-cars run until the very end makes sense.

Also, we probably need to be clear with post-stop semantics to users in our documentation. For example, If a main task succeeds but post-stop task fail (or the host dies during post-stop task), the allocation might be rescheduled and rerun.

client/allocrunner/task_hook_coordinator.go Outdated Show resolved Hide resolved
@@ -63,15 +77,29 @@ func (c *taskHookCoordinator) setTasks(tasks []*structs.Task) {
if !c.hasPrestartTasks() {
c.mainTaskCtxCancel()
}
if !c.hasMainTasks() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow jobs to not have main tasks? That seems like an invalid TaskGroup?

Also, we can close channel if there aren't any post-start tasks - this will make taskStateUpdated basically a no-op in the common case.

client/allocrunner/task_hook_coordinator.go Outdated Show resolved Hide resolved
Comment on lines 143 to 161
for task := range c.poststopTasks {
st := states[task]
if st == nil || !st.Successful() {
continue
}

delete(c.poststopTasks, task)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to track post stop tasks? We don't depend on their state currently, as nothing is blocked by them.

@tgross tgross force-pushed the lifecycle-poststop-hook branch 4 times, most recently from 9fbbf9b to 8a98124 Compare June 25, 2020 14:14
@tgross
Copy link
Member

tgross commented Jun 25, 2020

E2E tests:

=== RUN   TestE2E/Lifecycle/*lifecycle.LifecycleE2ETest/TestBatchJob
=== RUN   TestE2E/Lifecycle/*lifecycle.LifecycleE2ETest/TestServiceJob
    TestE2E/Lifecycle/*lifecycle.LifecycleE2ETest/TestServiceJob: lifecycle.go:92:
                Error Trace:    lifecycle.go:92
                Error:          Not equal:
                                expected: map[string]bool{"cleanup-ran":true, "cleanup-running":false, "init-ran":true, "init-running":false, "main-ran":true, "main-running":false, "sidecar-ran":true, "sidecar-running":false}
                                actual  : map[string]bool{"cleanup-ran":false, "cleanup-running":false, "init-ran":false, "init-running":false, "main-ran":false, "main-running":false, "sidecar-ran":true, "sidecar-running":true}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,10 +1,10 @@
                                 (map[string]bool) (len=8) {
                                - (string) (len=11) "cleanup-ran": (bool) true,
                                + (string) (len=11) "cleanup-ran": (bool) false,
                                  (string) (len=15) "cleanup-running": (bool) false,
                                - (string) (len=8) "init-ran": (bool) true,
                                + (string) (len=8) "init-ran": (bool) false,
                                  (string) (len=12) "init-running": (bool) false,
                                - (string) (len=8) "main-ran": (bool) true,
                                + (string) (len=8) "main-ran": (bool) false,
                                  (string) (len=12) "main-running": (bool) false,
                                  (string) (len=11) "sidecar-ran": (bool) true,
                                - (string) (len=15) "sidecar-running": (bool) false
                                + (string) (len=15) "sidecar-running": (bool) true
                                 }
                Test:           TestE2E/Lifecycle/*lifecycle.LifecycleE2ETest/TestServiceJob

@tgross tgross force-pushed the lifecycle-poststop-hook branch 2 times, most recently from 1feda39 to 5e6dcd8 Compare June 25, 2020 14:26
@jazzyfresh jazzyfresh mentioned this pull request Jun 30, 2020
16 tasks
@cgbaker cgbaker removed their request for review July 15, 2020 21:25
@jazzyfresh jazzyfresh removed the request for review from jrasell July 15, 2020 21:26
@jazzyfresh jazzyfresh added this to the 0.12.4 milestone Aug 31, 2020
@jazzyfresh jazzyfresh marked this pull request as ready for review August 31, 2020 18:13
@jazzyfresh jazzyfresh removed this from the 0.12.4 milestone Aug 31, 2020
@schmichael schmichael linked an issue Aug 31, 2020 that may be closed by this pull request
16 tasks
@jazzyfresh jazzyfresh added this to the 0.12.4 milestone Sep 1, 2020
@tgross tgross modified the milestones: 0.12.4, 0.13 Sep 9, 2020
@vercel
Copy link

vercel bot commented Oct 8, 2020

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/hashicorp/nomad/n6d295jf2
✅ Preview: https://nomad-git-lifecycle-poststop-hook.hashicorp.vercel.app

[Deployment for 758ca14 canceled]

for _, task := range ar.tasks {
if task.IsPoststopTask() {
<-task.WaitCh()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is responsible for starting poststop tasks after the other tasks have finished.

This covers the nomad job stop case:

  • wait until existing tasks have finished running
  • start poststop tasks
  • wait until poststop tasks finish

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this compare to having the coordinator tracking main task completion? The coordinator logic now already tracks task completions (so main tasks start after pre-start) - can we reuse the same state machine rather than add the logic here?

Copy link
Contributor Author

@jazzyfresh jazzyfresh Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed offline & the goal is to do a refactor in another pull request to move the poststop specific code out of the alloc runner & back into the task hook coordinator. The reason it was avoided this time around because I didn't know how to block progression of the allocrunner from exiting during a service job stop & still signal the poststop tasks to start

if tr.IsPoststopTask() {
continue
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poststop tasks have special kill behavior: i.e. we don't want to kill them if we receive a kill signal from nomad job stop. We want to wait until everything else has been killed, then we want to run the poststop tasks.

Copy link
Contributor Author

@jazzyfresh jazzyfresh Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping the rest of this loop essentially says: don't track poststop tasks along with the main group of tasks (in the set of liverunners), that way we won't log them later as being killed when there is a kill event

@@ -574,7 +593,8 @@ func (ar *allocRunner) killTasks() map[string]*structs.TaskState {
// Kill the rest concurrently
wg := sync.WaitGroup{}
for name, tr := range ar.tasks {
if tr.IsLeader() {
// Filter out poststop tasks so they run after all the other tasks are killed
if tr.IsLeader() || tr.IsPoststopTask() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the line that actually prevents poststop tasks from being killed in a kill event.

Copy link

@rcoder rcoder Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks for IsPoststopTask() spread throughout the code, especially when they're part of a multi-clause boolean check, strike me as a bit low-level and specific to this task type, vs. capturing a general property of the tasks.

Would it be reasonable to explicitly define interface methods for e.g. tr.isReadyToStart() and tr.isReadyToKill() (or similar) that hid the property checks behind the task runner interface? (OTOH, this could easily be future-proofing something we don't care about yet, or run contrary to our standard practices around expanding interface footprint.)


delete(c.mainTasksRunning, task)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a main task is dead, remove from the set (when all tasks are removed from the set, poststop tasks may proceed with execution)

}

func (c *taskHookCoordinator) StartPoststopTasks() {
c.poststopTaskCtxCancel()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs comment: helper function for starting poststop tasks outside of the handleTaskStateUpdate() infinite loop

Copy link

@rcoder rcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel qualified to verify the correctness of this approach, though the broad shape seems fine. My comments are mostly around naming + interface conventions, and a bit about testing and clocks.

It does seem like a test at least documenting the current behavior when a poststop task gets explicitly killed would be useful. Similarly, the current e2e tests only appear to check poststop task behavior for batch jobs. Is it our intent to only support hooking into batch jobs, or could we add some additional tests for what happens with other job types? (NMD-017 doesn't appear to present a spec or even opinion on this, so I'm open to whatever answer makes sense right now, as long as we have some documentation and tests to explain it.)

@@ -523,6 +541,7 @@ func (ar *allocRunner) handleTaskStateUpdates() {
// prevent looping before TaskRunners have transitioned
// to Dead.
for _, tr := range liveRunners {
ar.logger.Info("killing task: ", tr.Task().Name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this actually happen at the info level? My expectation is that those messages will actually hit the console in a default config, and adding a new sort of tracing/debug message here (w/o putting it at debug log level) could create log noise that most operators won't know what to do with.

@@ -574,7 +593,8 @@ func (ar *allocRunner) killTasks() map[string]*structs.TaskState {
// Kill the rest concurrently
wg := sync.WaitGroup{}
for name, tr := range ar.tasks {
if tr.IsLeader() {
// Filter out poststop tasks so they run after all the other tasks are killed
if tr.IsLeader() || tr.IsPoststopTask() {
Copy link

@rcoder rcoder Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks for IsPoststopTask() spread throughout the code, especially when they're part of a multi-clause boolean check, strike me as a bit low-level and specific to this task type, vs. capturing a general property of the tasks.

Would it be reasonable to explicitly define interface methods for e.g. tr.isReadyToStart() and tr.isReadyToKill() (or similar) that hid the property checks behind the task runner interface? (OTOH, this could easily be future-proofing something we don't care about yet, or run contrary to our standard practices around expanding interface footprint.)

ephemeralTask := alloc.Job.TaskGroups[0].Tasks[1]
ephemeralTask.Name = "quit"
ephemeralTask.Lifecycle.Hook = structs.TaskLifecycleHookPoststop
ephemeralTask.Config["run_for"] = "10s"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these numbers wall-clock time? (I.e., will this test take a minimum of 10 seconds to run?) It may not be a big deal in isolation, but actual blocking sleep calls scattered throughout a test suite can pile up and make tests slooooowwww.

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of questions and we can follow up in sync

for _, task := range ar.tasks {
if task.IsPoststopTask() {
<-task.WaitCh()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this compare to having the coordinator tracking main task completion? The coordinator logic now already tracks task completions (so main tasks start after pre-start) - can we reuse the same state machine rather than add the logic here?

// TestAllocRunner_Lifecycle_Poststop asserts that a service job with 1
// postop lifecycle hook starts all 3 tasks, only
// the ephemeral one finishes, and the other 2 exit when the alloc is stopped.
func TestAllocRunner_Lifecycle_Poststop(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing the name of this test - as this tests stop behavior - so TestAllocRunner_Lifecycle_Poststop_IfStopped?

Also, should add a test if main tasks complete naturally, for batch and service jobs.
Also, should add a test for when

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this to be included in the beta release. We will follow up with additional tests and refactoring.

@jazzyfresh jazzyfresh merged commit b85cce4 into master Nov 12, 2020
@jazzyfresh jazzyfresh deleted the lifecycle-poststop-hook branch November 12, 2020 16:01
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task Lifecycle PostStop Hook
5 participants