Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: always run alloc cleanup hooks on final update #15855

Merged
merged 2 commits into from
Jan 27, 2023
Merged

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Jan 23, 2023

This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.

Fixes #15477

This PR fixes a bug where alloc pre-kill hooks were not run in the
edge case where there are no live tasks remaining, but it is also
the final update to process for the (terminal) allocation. We need
to run cleanup hooks here, otherwise they will not run until the
allocation gets garbage collected (i.e. via Destroy()), possibly
at a distant time in the future.

Fixes #15477
@shoenig shoenig changed the title client: always run alloc cleanup hooks on last pass client: always run alloc cleanup hooks on final update Jan 23, 2023
@shoenig shoenig marked this pull request as ready for review January 23, 2023 21:50
Comment on lines 591 to 597
// there are no live runners left

// run AR pre-kill hooks if this alloc is terminal; any post-stop
// tasks would regularly run in this state anyway (?)
if done {
ar.preKillHooks()
}
Copy link
Member

@schmichael schmichael Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will erroneously run prekill hooks on client agent shutdown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a guard with ar.isShuttingDown()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if preKillHooks() shouldn't be run during agent shutdown, would it make sense to put the guard inside of preKillHooks() instead of here? more broadly, if this func is a potential foot-shoot, is there a way to make it less foot-shoot-able?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question @gulducat. I'm afraid it needs to be up to the caller to decide whether prekill hooks should run or not because only the callers knows whether a kill event (eg a task dying) was triggered or not. Once your in prekill, a shutdown may be issued concurrently, but that doesn't change the fact that the prekill was triggered by a kill event. So I think this is the right approach as @shoenig's code is able to differentiate a kill event from an agent shutdown event.

That being said every operation in Nomad needs to be crash safe which is why prekill hooks are also run when garbage collecting dead allocations. So prekill should always eventually be called on every terminal allocation regardless of the triggering event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand things now 😅

preKillHooks is only called by killTasks() and killTasks is called in a few places:

The part that was confusing me was why handleAllocUpdate was not triggering the preKillHooks: the allocation does eventually die and is marked with ClientStatus: failed.

The reason is that this is a client-only alloc update. When watching allocations the client ignores changes that it already knows about and so the ar.Update() method is not called when the allocation fails client-side.

We don't want to destroy the alloc runner so the only place left to call the hooks is in handleTaskStateUpdates like @shoenig did.

The second thing that was confusing to me is why we only check for killEvent if there are live runners, I feel like we should always be checking that and calling killTasks() if not nil? Something like https://github.com/hashicorp/nomad/compare/wip-luiz-kill-ar (this probably breaks other stuff since task lifecycle is finicky 😅).

All of this to say that I think we can use if killEvent != nil here to guard the preKillHooks so it matches the other conditional.

Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💀 ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.4.x backport to 1.4.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Health check routine leaks using new nomad provider
4 participants