client: always run alloc cleanup hooks on final update #15855

shoenig · 2023-01-23T19:24:16Z

This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.

Fixes #15477

This PR fixes a bug where alloc pre-kill hooks were not run in the edge case where there are no live tasks remaining, but it is also the final update to process for the (terminal) allocation. We need to run cleanup hooks here, otherwise they will not run until the allocation gets garbage collected (i.e. via Destroy()), possibly at a distant time in the future. Fixes #15477

schmichael · 2023-01-23T21:53:57Z

client/allocrunner/alloc_runner.go

+			// there are no live runners left
+
+			// run AR pre-kill hooks if this alloc is terminal; any post-stop
+			// tasks would regularly run in this state anyway (?)
+			if done {
+				ar.preKillHooks()
+			}


I think this will erroneously run prekill hooks on client agent shutdown.

I added a guard with ar.isShuttingDown()

if preKillHooks() shouldn't be run during agent shutdown, would it make sense to put the guard inside of preKillHooks() instead of here? more broadly, if this func is a potential foot-shoot, is there a way to make it less foot-shoot-able?

Good question @gulducat. I'm afraid it needs to be up to the caller to decide whether prekill hooks should run or not because only the callers knows whether a kill event (eg a task dying) was triggered or not. Once your in prekill, a shutdown may be issued concurrently, but that doesn't change the fact that the prekill was triggered by a kill event. So I think this is the right approach as @shoenig's code is able to differentiate a kill event from an agent shutdown event.

That being said every operation in Nomad needs to be crash safe which is why prekill hooks are also run when garbage collecting dead allocations. So prekill should always eventually be called on every terminal allocation regardless of the triggering event.

I think I understand things now 😅

preKillHooks is only called by killTasks() and killTasks is called in a few places:

destroyImpl which is quite expected.

handleTaskStateUpdates which also makes sense since we may need to kill all tasks if one of the tasks fails.

handleAllocUpdate which again makes sense since the alloc is stopped we need to kill the tasks.

The part that was confusing me was why handleAllocUpdate was not triggering the preKillHooks: the allocation does eventually die and is marked with ClientStatus: failed.

The reason is that this is a client-only alloc update. When watching allocations the client ignores changes that it already knows about and so the ar.Update() method is not called when the allocation fails client-side.

We don't want to destroy the alloc runner so the only place left to call the hooks is in handleTaskStateUpdates like @shoenig did.

The second thing that was confusing to me is why we only check for killEvent if there are live runners, I feel like we should always be checking that and calling killTasks() if not nil? Something like https://github.com/hashicorp/nomad/compare/wip-luiz-kill-ar (this probably breaks other stuff since task lifecycle is finicky 😅).

All of this to say that I think we can use if killEvent != nil here to guard the preKillHooks so it matches the other conditional.

gulducat

💀 ✅

vercel bot deployed to Preview – nomad-storybook-and-ui January 23, 2023 19:29 View deployment

shoenig force-pushed the f-nsd-check-leaks branch from b1803a6 to 27e6363 Compare January 23, 2023 19:56

vercel bot deployed to Preview – nomad-storybook-and-ui January 23, 2023 20:02 View deployment

shoenig force-pushed the f-nsd-check-leaks branch from 27e6363 to 6d6d5df Compare January 23, 2023 21:20

shoenig force-pushed the f-nsd-check-leaks branch from 6d6d5df to 672dc16 Compare January 23, 2023 21:22

vercel bot deployed to Preview – nomad-storybook-and-ui January 23, 2023 21:28 View deployment

shoenig changed the title ~~client: always run alloc cleanup hooks on last pass~~ client: always run alloc cleanup hooks on final update Jan 23, 2023

shoenig marked this pull request as ready for review January 23, 2023 21:50

shoenig requested review from lgfa29, schmichael and gulducat January 23, 2023 21:50

schmichael reviewed Jan 23, 2023

View reviewed changes

client: do not run ar cleanup hooks if client is shutting down

137ce0b

vercel bot deployed to Preview – nomad-storybook-and-ui January 24, 2023 15:31 View deployment

gulducat approved these changes Jan 25, 2023

View reviewed changes

shoenig merged commit d30e342 into main Jan 27, 2023

shoenig deleted the f-nsd-check-leaks branch January 27, 2023 15:59

shoenig added the backport/1.4.x backport to 1.4.x release line label Jan 27, 2023

hc-github-team-nomad-core mentioned this pull request Jan 27, 2023

Backport of client: always run alloc cleanup hooks on final update into release/1.4.x #15924

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: always run alloc cleanup hooks on final update #15855

client: always run alloc cleanup hooks on final update #15855

shoenig commented Jan 23, 2023 •

edited

Loading

schmichael Jan 23, 2023 •

edited

Loading

shoenig Jan 24, 2023

gulducat Jan 24, 2023

schmichael Jan 24, 2023

lgfa29 Jan 25, 2023

gulducat left a comment

client: always run alloc cleanup hooks on final update #15855

client: always run alloc cleanup hooks on final update #15855

Conversation

shoenig commented Jan 23, 2023 • edited Loading

schmichael Jan 23, 2023 • edited Loading

Choose a reason for hiding this comment

shoenig Jan 24, 2023

Choose a reason for hiding this comment

gulducat Jan 24, 2023

Choose a reason for hiding this comment

schmichael Jan 24, 2023

Choose a reason for hiding this comment

lgfa29 Jan 25, 2023

Choose a reason for hiding this comment

gulducat left a comment

Choose a reason for hiding this comment

shoenig commented Jan 23, 2023 •

edited

Loading

schmichael Jan 23, 2023 •

edited

Loading