Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix task leak during client restore when allocrunner prerun hook fails #17104

Merged
merged 1 commit into from
May 8, 2023

Conversation

gulducat
Copy link
Member

@gulducat gulducat commented May 6, 2023

Fixes #17102 -- I describe the issue more fully there.

My approach here is to stop skipping over allocRunner.runTasks() on prerun error. That way, instead of duplicating more cleanup code, which may change in the future, the same TaskRunner.Run() code that usually handles task cleanup can do what it needs to as appropriate with tasks that fail prerun during the alloc restore process.

In pursuit of that, I made an error-inducing FailHook and added the ability to include it as part of client Config for the client integration test. I could remove the non-Prerun interface implementations, but I figured while I'm at it, may as well make a thing that can be induced to fail at any stage in case it's useful?

@gulducat gulducat requested review from shoenig and tgross May 6, 2023 02:58
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. We need a changelog entry via make cl, but also my comment about the commit description applies generally. If we squash-merge this as we typically do, we'll end up with a PR description that's just a reference somewhere else, instead of being self-contained within the commit message.

client/allocrunner/taskrunner/task_runner.go Show resolved Hide resolved
"github.com/hashicorp/nomad/client/allocrunner/interfaces"
)

var ErrFailHookError = errors.New("failed successfully")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Comment on lines +715 to +719
foundMessages := make(map[string]bool)
for _, event := range state.Events {
foundMessages[event.DisplayMessage] = true
}
test.True(t, foundMessages[reason], test.Sprintf("expected '%s' in events: %#v", reason, foundMessages))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Map contains value that meets this condition" seems like it'd be nice new frequently-used assertion for @shoenig's test library. But out of scope for this PR.

to avoid leaking task resources (e.g. containers,
iptables) if allocRunner prerun fails during
restore on client restart.

now if prerun fails, TaskRunner.MarkFailedKill()
will only emit an event, mark the task as failed,
and cancel the tr's killCtx, so then ar.runTasks()
-> tr.Run() can take care of the actual cleanup.

removed from (formerly) tr.MarkFailedDead(),
now handled by tr.Run():
 * set task state as dead
 * save task runner local state
 * task stop hooks

also done in tr.Run() now that it's not skipped:
 * handleKill() to kill tasks while respecting
   their shutdown delay, and retrying as needed
   * also includes task preKill hooks
 * clearDriverHandle() to destroy the task
   and associated resources
 * task exited hooks
@gulducat gulducat force-pushed the b-restore-prerun-error-task-leak branch from ad89b8f to ae7016d Compare May 8, 2023 15:16
@gulducat
Copy link
Member Author

gulducat commented May 8, 2023

Thanks! I added the changelog, and pre-squashed my commits with a more verbose message. How's it look?

Generally I do heavily edit the squash commit message like that when I go to merge, but it occurs to me that you couldn't know that, and can't pre-review what I may fill in there before merge 😋

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task leak if alloc prerun hook fails after client restart
2 participants