Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: stop some taskrunner hooks when task exits #15436

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Nov 30, 2022

Non-sidecar prestart and poststart tasks run to completion and exit, but their Stop hook are not called until the job is stopped. This leaves some background operations running even when the task itself is not running anymore.

Implementing the Exited hook stops these background processes on task exit. If the task is restarted their Prestart and Poststart hooks will run again and resume the background operations.

I skipped the Vault hook because token renewal has been quite finicky and trying to pause it could cause even more problems. Since it's just a goroutine that calls the Vault API not that frequently it doesn't consume much resources.

Closes #15419

Non-sidecar prestart and poststart tasks run to completion and exit, but
their Stop hook are not called until the job is stopped. This leaves
some background operations running even when the task itself is not
running anymore.

Implementing the Exited hook stops these background processes on task
exit. If the task is restarted their Prestart and Poststart hooks will
run again and resume the background operations.
// killed on Stop.
func TestTaskRunner_LogmonHook_StartStop(t *testing.T) {
// TestTaskRunner_LogmonHook_StartRestartStop asserts that a new logmon is
// created the first time Prestart is called, reattached to on subsequent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// a new logmon is created the first time Prestart is called

To me first time implies that a single logmon process should be reused and reconnected to between restarts. Mind rewriting this comment to make sure it's unambiguous?

TestTaskRunner_LogmonHook_StartRestartStop asserts that logmon's lifecycle matches the tasks: it is killed when the task exits and started (or restarted or reattached) whenever the task starts.

Read on for context if your bored 😅

I think logmon was written such that the process lived for the lifetime of the *alloc,* not the lifetime of an individual task invocation.

Most hooks that are implemented that way do so by setting response.Done = true somewhere in Prestart so that TaskRunner never bothers calling it again. However logmon does not since it does need to reconnect to the running logmon process on task restart or restart logmon if the entire host was restarted and logmon no longer exists...

...and because we have to handle that restart logmon if it did die case, I think just killing it every time will work fine functionally!

@@ -94,6 +94,17 @@ func TestTaskRunner_LogmonHook_StartStop(t *testing.T) {
origHookData = resp.State[logmonReattachKey]
require.Equal(t, origHookData, req.PreviousState[logmonReattachKey])

// Runnig exited should shutdown logmon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Runnig exited should shutdown logmon
// Running exited should shutdown logmon

Comment on lines +161 to +165
// Cancel all running scripts, but don't close the shutdownCh since the
// task may still be restarted.
for _, script := range h.runningScripts {
script.cancel()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause script checks with check.interval > restart.delay to be run more frequently. I can't imagine why that would be a bad thing since it just means we're heartbeating the TTL check more often, but I thought I'd mention it in case someone can dream up a way it could cause problems.

Comment on lines +172 to +175
h.templateManager.Stop()

// Set templateManager to nil so it's recreated by Prestart on restart.
h.templateManager = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will block the task from restarting until the template has been re-rendered.

I'm a little worried this could cause correlated failures if something like this happens:

  • A template dependency is unreachable
  • Task dies for an unrelated reason

Prior to this change the task would restart after restart.delay, use whatever templated files it used last, and get back to work.

After this change the task would be blocked from restarting until all dependencies were reachable.

To put it another way: prior to this change client agents could handle local "scheduling" decisions while the servers were in charge of cluster scheduling decisions. (I think that's broadly true? Artifacts and Vault for example both allow disconnected local operation as long as they were able to successfully run once.)

I think we need to maintain that behavior although I'd love to be convinced otherwise! 😅

A quick peek at TaskTemplateManager makes me think adding the ability to Pause/Resume it would be a significant effort. Adding a new parameter or other bit of plumbing to make TaskTemplateManager.Start() blocking on handleFirstRender optional might be an easier route.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potentially interesting case to consider:

  1. Task A happily talks to Service X
  2. Service X crashes or is otherwise partitioned from A
  3. Task A crashes due to Service X going away
  4. Task A is down for restart.delay duration
  5. Service Y is registered as a replacement for Service X
  6. Task A restarts and happily talks to Service Y

I think the behavior of Task A in this situation is:

During restart.delay After restart.delay
Before change Immediate restart Crash loop until Y is discovered
After change Blocked until Y is discovered

I think in the happy cases of Service Y being ~immediately available, your changes are optimal!

However there are some convoluted situations in which I think it could make an outage worse:

If the only way to discover Service Y was being rescheduled we're better off with the existing crash-loop behavior as that will eventually result in a rescheduling of the whole alloc. That's a pretty complex outage situation I think: Nomad would have fine connectivity but ...Task A's DC would lose connectivity to Consul and need rescheduling in a new DC? idk... in that complex of a failure scenario there may be a number of other failure conditions I'm not thinking of that alter the behavior of Task A.

@tgross tgross removed backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Feb 9, 2024
@tgross tgross added the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prestart / Init task continues to render templates
3 participants