client: stop some taskrunner hooks when task exits #15436

lgfa29 · 2022-11-30T23:20:55Z

Non-sidecar prestart and poststart tasks run to completion and exit, but their Stop hook are not called until the job is stopped. This leaves some background operations running even when the task itself is not running anymore.

Implementing the Exited hook stops these background processes on task exit. If the task is restarted their Prestart and Poststart hooks will run again and resume the background operations.

I skipped the Vault hook because token renewal has been quite finicky and trying to pause it could cause even more problems. Since it's just a goroutine that calls the Vault API not that frequently it doesn't consume much resources.

Closes #15419

Non-sidecar prestart and poststart tasks run to completion and exit, but their Stop hook are not called until the job is stopped. This leaves some background operations running even when the task itself is not running anymore. Implementing the Exited hook stops these background processes on task exit. If the task is restarted their Prestart and Poststart hooks will run again and resume the background operations.

schmichael · 2022-12-01T00:21:09Z

client/allocrunner/taskrunner/logmon_hook_test.go

-// killed on Stop.
-func TestTaskRunner_LogmonHook_StartStop(t *testing.T) {
+// TestTaskRunner_LogmonHook_StartRestartStop asserts that a new logmon is
+// created the first time Prestart is called, reattached to on subsequent


// a new logmon is created the first time Prestart is called

To me first time implies that a single logmon process should be reused and reconnected to between restarts. Mind rewriting this comment to make sure it's unambiguous?

TestTaskRunner_LogmonHook_StartRestartStop asserts that logmon's lifecycle matches the tasks: it is killed when the task exits and started (or restarted or reattached) whenever the task starts.

Read on for context if your bored 😅

I think logmon was written such that the process lived for the lifetime of the *alloc,* not the lifetime of an individual task invocation.
Most hooks that are implemented that way do so by setting response.Done = true somewhere in Prestart so that TaskRunner never bothers calling it again. However logmon does not since it does need to reconnect to the running logmon process on task restart or restart logmon if the entire host was restarted and logmon no longer exists...

...and because we have to handle that restart logmon if it did die case, I think just killing it every time will work fine functionally!

schmichael · 2022-12-01T00:21:26Z

client/allocrunner/taskrunner/logmon_hook_test.go

@@ -94,6 +94,17 @@ func TestTaskRunner_LogmonHook_StartStop(t *testing.T) {
 	origHookData = resp.State[logmonReattachKey]
 	require.Equal(t, origHookData, req.PreviousState[logmonReattachKey])

+	// Runnig exited should shutdown logmon


Suggested change

// Runnig exited should shutdown logmon

// Running exited should shutdown logmon

schmichael · 2022-12-01T00:33:14Z

client/allocrunner/taskrunner/script_check_hook.go

+	// Cancel all running scripts, but don't close the shutdownCh since the
+	// task may still be restarted.
+	for _, script := range h.runningScripts {
+		script.cancel()
+	}


This will cause script checks with check.interval > restart.delay to be run more frequently. I can't imagine why that would be a bad thing since it just means we're heartbeating the TTL check more often, but I thought I'd mention it in case someone can dream up a way it could cause problems.

schmichael · 2022-12-01T00:50:44Z

client/allocrunner/taskrunner/template_hook.go

+		h.templateManager.Stop()
+
+		// Set templateManager to nil so it's recreated by Prestart on restart.
+		h.templateManager = nil


This will block the task from restarting until the template has been re-rendered.

I'm a little worried this could cause correlated failures if something like this happens:

A template dependency is unreachable

Task dies for an unrelated reason

Prior to this change the task would restart after restart.delay, use whatever templated files it used last, and get back to work.

After this change the task would be blocked from restarting until all dependencies were reachable.

To put it another way: prior to this change client agents could handle local "scheduling" decisions while the servers were in charge of cluster scheduling decisions. (I think that's broadly true? Artifacts and Vault for example both allow disconnected local operation as long as they were able to successfully run once.)

I think we need to maintain that behavior although I'd love to be convinced otherwise! 😅

A quick peek at TaskTemplateManager makes me think adding the ability to Pause/Resume it would be a significant effort. Adding a new parameter or other bit of plumbing to make TaskTemplateManager.Start() blocking on handleFirstRender optional might be an easier route.

A potentially interesting case to consider:

Task A happily talks to Service X

Service X crashes or is otherwise partitioned from A

Task A crashes due to Service X going away

Task A is down for restart.delay duration

Service Y is registered as a replacement for Service X

Task A restarts and happily talks to Service Y

I think the behavior of Task A in this situation is:

During restart.delay After restart.delay

Before change Immediate restart Crash loop until Y is discovered

After change Blocked until Y is discovered

I think in the happy cases of Service Y being ~immediately available, your changes are optimal!

However there are some convoluted situations in which I think it could make an outage worse:

If the only way to discover Service Y was being rescheduled we're better off with the existing crash-loop behavior as that will eventually result in a rescheduling of the whole alloc. That's a pretty complex outage situation I think: Nomad would have fine connectivity but ...Task A's DC would lose connectivity to Consul and need rescheduling in a new DC? idk... in that complex of a failure scenario there may be a number of other failure conditions I'm not thinking of that alter the behavior of Task A.

vercel bot deployed to Preview – nomad-storybook-and-ui November 30, 2022 23:21 View deployment

changelog: add entry for #15436

08db11c

lgfa29 added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Nov 30, 2022

lgfa29 requested review from schmichael and tgross November 30, 2022 23:24

lgfa29 mentioned this pull request Nov 30, 2022

Prestart / Init task continues to render templates #15419

Open

vercel bot deployed to Preview – nomad-storybook-and-ui November 30, 2022 23:28 View deployment

schmichael requested changes Dec 1, 2022

View reviewed changes

tgross removed backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Feb 9, 2024

tgross added the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: stop some taskrunner hooks when task exits #15436

client: stop some taskrunner hooks when task exits #15436

lgfa29 commented Nov 30, 2022

schmichael Dec 1, 2022

schmichael Dec 1, 2022

schmichael Dec 1, 2022

schmichael Dec 1, 2022

schmichael Dec 1, 2022

	// Runnig exited should shutdown logmon
	// Running exited should shutdown logmon

	During restart.delay	After restart.delay
Before change	Immediate restart	Crash loop until Y is discovered
After change	Blocked until Y is discovered

client: stop some taskrunner hooks when task exits #15436

Are you sure you want to change the base?

client: stop some taskrunner hooks when task exits #15436

Conversation

lgfa29 commented Nov 30, 2022

schmichael Dec 1, 2022

Choose a reason for hiding this comment

schmichael Dec 1, 2022

Choose a reason for hiding this comment

schmichael Dec 1, 2022

Choose a reason for hiding this comment

schmichael Dec 1, 2022

Choose a reason for hiding this comment

schmichael Dec 1, 2022

Choose a reason for hiding this comment