lifecycle: add poststop hook #8194

jazzyfresh · 2020-06-17T23:24:28Z

Overview

A poststop hook has been added to the lifecycle stanza to allow users to run tasks after all other tasks in an allocation have completed.

Behavior

This poststop task will run after main tasks are finished running. It will run no matter the final status of the main tasks, whether it is completed, failed or killed.

batch job: poststop waits until main tasks have finished before starting
service & system jobs: since services & system jobs are long-lived, poststop will only run poststop tasks after the allocation receives a kill signal or a main task fails

Future Development

I plan to make these improvements in separate pull requests

increased e2e testing
refactor to move behavior out of the allocrunner Run() & back into the TaskHookCoordinator
expose prior task state to poststop task << this requires a ticket for usability research & user feedback (we could have an environment variable and also put the full task state into the alloc filesystem, but we need to figure out what a user would expect)

notnoop

Having tests would be great. I'm also unclear what the semantics of post-stop is and/or what use cases we want to target - It's unclear to me when they should run?

Should they run if the main tasks fail? Clean up tasks should probably run all the time
Should they run if nomad job stop is invoked? I think current implementation would probably not run them
Should sidecars run concurrently with post-stop tasks? Having side-cars run until the very end makes sense.

Also, we probably need to be clear with post-stop semantics to users in our documentation. For example, If a main task succeeds but post-stop task fail (or the host dies during post-stop task), the allocation might be rescheduled and rerun.

client/allocrunner/task_hook_coordinator.go

notnoop · 2020-06-18T12:27:18Z

client/allocrunner/task_hook_coordinator.go

@@ -63,15 +77,29 @@ func (c *taskHookCoordinator) setTasks(tasks []*structs.Task) {
 	if !c.hasPrestartTasks() {
 		c.mainTaskCtxCancel()
 	}
+	if !c.hasMainTasks() {


Do we want to allow jobs to not have main tasks? That seems like an invalid TaskGroup?

Also, we can close channel if there aren't any post-start tasks - this will make taskStateUpdated basically a no-op in the common case.

client/allocrunner/task_hook_coordinator.go

notnoop · 2020-06-18T12:40:11Z

client/allocrunner/task_hook_coordinator.go

+	for task := range c.poststopTasks {
+		st := states[task]
+		if st == nil || !st.Successful() {
+			continue
+		}
+
+		delete(c.poststopTasks, task)
 	}


Do we need to track post stop tasks? We don't depend on their state currently, as nothing is blocked by them.

client/allocrunner/task_hook_coordinator.go

tgross · 2020-06-25T14:16:37Z

E2E tests:

=== RUN   TestE2E/Lifecycle/*lifecycle.LifecycleE2ETest/TestBatchJob
=== RUN   TestE2E/Lifecycle/*lifecycle.LifecycleE2ETest/TestServiceJob
    TestE2E/Lifecycle/*lifecycle.LifecycleE2ETest/TestServiceJob: lifecycle.go:92:
                Error Trace:    lifecycle.go:92
                Error:          Not equal:
                                expected: map[string]bool{"cleanup-ran":true, "cleanup-running":false, "init-ran":true, "init-running":false, "main-ran":true, "main-running":false, "sidecar-ran":true, "sidecar-running":false}
                                actual  : map[string]bool{"cleanup-ran":false, "cleanup-running":false, "init-ran":false, "init-running":false, "main-ran":false, "main-running":false, "sidecar-ran":true, "sidecar-running":true}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,10 +1,10 @@
                                 (map[string]bool) (len=8) {
                                - (string) (len=11) "cleanup-ran": (bool) true,
                                + (string) (len=11) "cleanup-ran": (bool) false,
                                  (string) (len=15) "cleanup-running": (bool) false,
                                - (string) (len=8) "init-ran": (bool) true,
                                + (string) (len=8) "init-ran": (bool) false,
                                  (string) (len=12) "init-running": (bool) false,
                                - (string) (len=8) "main-ran": (bool) true,
                                + (string) (len=8) "main-ran": (bool) false,
                                  (string) (len=12) "main-running": (bool) false,
                                  (string) (len=11) "sidecar-ran": (bool) true,
                                - (string) (len=15) "sidecar-running": (bool) false
                                + (string) (len=15) "sidecar-running": (bool) true
                                 }
                Test:           TestE2E/Lifecycle/*lifecycle.LifecycleE2ETest/TestServiceJob

vercel · 2020-10-08T17:59:09Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/hashicorp/nomad/n6d295jf2
✅ Preview: https://nomad-git-lifecycle-poststop-hook.hashicorp.vercel.app

[Deployment for 758ca14 canceled]

jazzyfresh · 2020-10-08T18:13:45Z

client/allocrunner/alloc_runner.go

+	for _, task := range ar.tasks {
+		if task.IsPoststopTask() {
+			<-task.WaitCh()
+		}


This logic is responsible for starting poststop tasks after the other tasks have finished.

This covers the nomad job stop case:

wait until existing tasks have finished running

start poststop tasks

wait until poststop tasks finish

How does this compare to having the coordinator tracking main task completion? The coordinator logic now already tracks task completions (so main tasks start after pre-start) - can we reuse the same state machine rather than add the logic here?

This was discussed offline & the goal is to do a refactor in another pull request to move the poststop specific code out of the alloc runner & back into the task hook coordinator. The reason it was avoided this time around because I didn't know how to block progression of the allocrunner from exiting during a service job stop & still signal the poststop tasks to start

jazzyfresh · 2020-10-08T18:14:45Z

client/allocrunner/alloc_runner.go

+			if tr.IsPoststopTask() {
+				continue
+			}
+


Poststop tasks have special kill behavior: i.e. we don't want to kill them if we receive a kill signal from nomad job stop. We want to wait until everything else has been killed, then we want to run the poststop tasks.

Skipping the rest of this loop essentially says: don't track poststop tasks along with the main group of tasks (in the set of liverunners), that way we won't log them later as being killed when there is a kill event

jazzyfresh · 2020-10-08T18:18:24Z

client/allocrunner/alloc_runner.go

@@ -574,7 +593,8 @@ func (ar *allocRunner) killTasks() map[string]*structs.TaskState {
 	// Kill the rest concurrently
 	wg := sync.WaitGroup{}
 	for name, tr := range ar.tasks {
-		if tr.IsLeader() {
+		// Filter out poststop tasks so they run after all the other tasks are killed
+		if tr.IsLeader() || tr.IsPoststopTask() {


This is the line that actually prevents poststop tasks from being killed in a kill event.

The checks for IsPoststopTask() spread throughout the code, especially when they're part of a multi-clause boolean check, strike me as a bit low-level and specific to this task type, vs. capturing a general property of the tasks.

Would it be reasonable to explicitly define interface methods for e.g. tr.isReadyToStart() and tr.isReadyToKill() (or similar) that hid the property checks behind the task runner interface? (OTOH, this could easily be future-proofing something we don't care about yet, or run contrary to our standard practices around expanding interface footprint.)

client/allocrunner/task_hook_coordinator.go

jazzyfresh · 2020-10-08T18:20:22Z

client/allocrunner/task_hook_coordinator.go

+
+		delete(c.mainTasksRunning, task)
+	}
+


If a main task is dead, remove from the set (when all tasks are removed from the set, poststop tasks may proceed with execution)

jazzyfresh · 2020-10-08T18:21:40Z

client/allocrunner/task_hook_coordinator.go

+}
+
+func (c *taskHookCoordinator) StartPoststopTasks() {
+	c.poststopTaskCtxCancel()


Needs comment: helper function for starting poststop tasks outside of the handleTaskStateUpdate() infinite loop

rcoder

I don't feel qualified to verify the correctness of this approach, though the broad shape seems fine. My comments are mostly around naming + interface conventions, and a bit about testing and clocks.

It does seem like a test at least documenting the current behavior when a poststop task gets explicitly killed would be useful. Similarly, the current e2e tests only appear to check poststop task behavior for batch jobs. Is it our intent to only support hooking into batch jobs, or could we add some additional tests for what happens with other job types? (NMD-017 doesn't appear to present a spec or even opinion on this, so I'm open to whatever answer makes sense right now, as long as we have some documentation and tests to explain it.)

rcoder · 2020-10-08T18:25:54Z

client/allocrunner/alloc_runner.go

@@ -523,6 +541,7 @@ func (ar *allocRunner) handleTaskStateUpdates() {
 			// prevent looping before TaskRunners have transitioned
 			// to Dead.
 			for _, tr := range liveRunners {
+				ar.logger.Info("killing task: ", tr.Task().Name)


Should this actually happen at the info level? My expectation is that those messages will actually hit the console in a default config, and adding a new sort of tracing/debug message here (w/o putting it at debug log level) could create log noise that most operators won't know what to do with.

rcoder · 2020-10-08T18:30:31Z

client/allocrunner/alloc_runner.go

@@ -574,7 +593,8 @@ func (ar *allocRunner) killTasks() map[string]*structs.TaskState {
 	// Kill the rest concurrently
 	wg := sync.WaitGroup{}
 	for name, tr := range ar.tasks {
-		if tr.IsLeader() {
+		// Filter out poststop tasks so they run after all the other tasks are killed
+		if tr.IsLeader() || tr.IsPoststopTask() {


The checks for IsPoststopTask() spread throughout the code, especially when they're part of a multi-clause boolean check, strike me as a bit low-level and specific to this task type, vs. capturing a general property of the tasks.

Would it be reasonable to explicitly define interface methods for e.g. tr.isReadyToStart() and tr.isReadyToKill() (or similar) that hid the property checks behind the task runner interface? (OTOH, this could easily be future-proofing something we don't care about yet, or run contrary to our standard practices around expanding interface footprint.)

rcoder · 2020-10-08T18:33:34Z

client/allocrunner/alloc_runner_test.go

+	ephemeralTask := alloc.Job.TaskGroups[0].Tasks[1]
+	ephemeralTask.Name = "quit"
+	ephemeralTask.Lifecycle.Hook = structs.TaskLifecycleHookPoststop
+	ephemeralTask.Config["run_for"] = "10s"


Are these numbers wall-clock time? (I.e., will this test take a minimum of 10 seconds to run?) It may not be a big deal in isolation, but actual blocking sleep calls scattered throughout a test suite can pile up and make tests slooooowwww.

notnoop

a couple of questions and we can follow up in sync

notnoop · 2020-10-29T18:05:30Z

client/allocrunner/alloc_runner.go

+	for _, task := range ar.tasks {
+		if task.IsPoststopTask() {
+			<-task.WaitCh()
+		}


How does this compare to having the coordinator tracking main task completion? The coordinator logic now already tracks task completions (so main tasks start after pre-start) - can we reuse the same state machine rather than add the logic here?

notnoop · 2020-10-29T18:13:39Z

client/allocrunner/alloc_runner_test.go

+// TestAllocRunner_Lifecycle_Poststop asserts that a service job with 1
+// postop lifecycle hook starts all 3 tasks, only
+// the ephemeral one finishes, and the other 2 exit when the alloc is stopped.
+func TestAllocRunner_Lifecycle_Poststop(t *testing.T) {


I would suggest changing the name of this test - as this tests stop behavior - so TestAllocRunner_Lifecycle_Poststop_IfStopped?

Also, should add a test if main tasks complete naturally, for batch and service jobs.
Also, should add a test for when

notnoop

Let's merge this to be included in the beta release. We will follow up with additional tests and refactoring.

github-actions · 2022-12-11T02:19:13Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jazzyfresh requested review from notnoop, lgfa29, cgbaker and jrasell June 17, 2020 23:24

notnoop suggested changes Jun 18, 2020

View reviewed changes

lgfa29 reviewed Jun 19, 2020

View reviewed changes

client/allocrunner/task_hook_coordinator.go Outdated Show resolved Hide resolved

jazzyfresh added 0.12-beta2 and removed 0.12-beta2 labels Jun 19, 2020

backspace mentioned this pull request Jun 22, 2020

UI: Add rendering of poststop tasks #8117

Closed

tgross force-pushed the lifecycle-poststop-hook branch 4 times, most recently from 9fbbf9b to 8a98124 Compare June 25, 2020 14:14

tgross force-pushed the lifecycle-poststop-hook branch 2 times, most recently from 1feda39 to 5e6dcd8 Compare June 25, 2020 14:26

jazzyfresh mentioned this pull request Jun 30, 2020

Task Lifecycle PostStop Hook #8193

Closed

16 tasks

cgbaker removed their request for review July 15, 2020 21:25

jazzyfresh removed the request for review from jrasell July 15, 2020 21:26

jazzyfresh force-pushed the lifecycle-poststop-hook branch from 1ec0c1c to 810ca4b Compare August 25, 2020 19:59

backspace mentioned this pull request Aug 26, 2020

UI: Add poststart and poststop lifecycle phases #8742

Merged

jazzyfresh added this to the 0.12.4 milestone Aug 31, 2020

jazzyfresh marked this pull request as ready for review August 31, 2020 18:13

jazzyfresh force-pushed the lifecycle-poststop-hook branch from 4238e63 to c343cdb Compare August 31, 2020 18:32

jazzyfresh removed this from the 0.12.4 milestone Aug 31, 2020

schmichael linked an issue Aug 31, 2020 that may be closed by this pull request

Task Lifecycle PostStop Hook #8193

Closed

16 tasks

jazzyfresh added this to the 0.12.4 milestone Sep 1, 2020

jazzyfresh force-pushed the lifecycle-poststop-hook branch from b958586 to 77efd8f Compare September 3, 2020 17:26

tgross modified the milestones: 0.12.4, 0.13 Sep 9, 2020

vercel bot deployed to Preview October 8, 2020 17:59 View deployment

jazzyfresh commented Oct 8, 2020

View reviewed changes

client/allocrunner/task_hook_coordinator.go Show resolved Hide resolved

jazzyfresh commented Oct 8, 2020

View reviewed changes

rcoder reviewed Oct 8, 2020

View reviewed changes

notnoop reviewed Oct 29, 2020

View reviewed changes

vercel bot temporarily deployed to Preview November 10, 2020 16:22 Inactive

vercel bot temporarily deployed to Preview November 11, 2020 17:25 Inactive

lifecycle: add poststop hook

2eca731

jazzyfresh force-pushed the lifecycle-poststop-hook branch from 758ca14 to 2eca731 Compare November 11, 2020 17:46

vercel bot temporarily deployed to Preview November 11, 2020 17:46 Inactive

jazzyfresh requested a review from notnoop November 11, 2020 23:23

lifecycle: lint in structs

64d1e19

vercel bot temporarily deployed to Preview November 12, 2020 00:45 Inactive

notnoop approved these changes Nov 12, 2020

View reviewed changes

jazzyfresh merged commit b85cce4 into master Nov 12, 2020

jazzyfresh deleted the lifecycle-poststop-hook branch November 12, 2020 16:01

github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lifecycle: add poststop hook #8194

lifecycle: add poststop hook #8194

jazzyfresh commented Jun 17, 2020 •

edited

Loading

notnoop left a comment

notnoop Jun 18, 2020

notnoop Jun 18, 2020

tgross commented Jun 25, 2020

vercel bot commented Oct 8, 2020 •

edited

Loading

jazzyfresh Oct 8, 2020

notnoop Oct 29, 2020

jazzyfresh Nov 11, 2020 •

edited

Loading

jazzyfresh Oct 8, 2020

jazzyfresh Oct 8, 2020 •

edited

Loading

jazzyfresh Oct 8, 2020

rcoder Oct 8, 2020 •

edited

Loading

jazzyfresh Oct 8, 2020

jazzyfresh Oct 8, 2020

rcoder left a comment

rcoder Oct 8, 2020

rcoder Oct 8, 2020 •

edited

Loading

rcoder Oct 8, 2020

notnoop left a comment

notnoop Oct 29, 2020

notnoop Oct 29, 2020

notnoop left a comment

github-actions bot commented Dec 11, 2022

lifecycle: add poststop hook #8194

lifecycle: add poststop hook #8194

Conversation

jazzyfresh commented Jun 17, 2020 • edited Loading

Overview

Behavior

Future Development

notnoop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross commented Jun 25, 2020

vercel bot commented Oct 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jazzyfresh Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jazzyfresh Oct 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcoder Oct 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcoder Oct 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notnoop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notnoop left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 11, 2022

jazzyfresh commented Jun 17, 2020 •

edited

Loading

vercel bot commented Oct 8, 2020 •

edited

Loading

jazzyfresh Nov 11, 2020 •

edited

Loading

jazzyfresh Oct 8, 2020 •

edited

Loading

rcoder Oct 8, 2020 •

edited

Loading

rcoder Oct 8, 2020 •

edited

Loading