Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task leak if alloc prerun hook fails after client restart #17102

Closed
gulducat opened this issue May 5, 2023 · 4 comments · Fixed by #17104
Closed

Task leak if alloc prerun hook fails after client restart #17102

gulducat opened this issue May 5, 2023 · 4 comments · Fixed by #17104
Assignees
Labels

Comments

@gulducat
Copy link
Member

gulducat commented May 5, 2023

If a prerun hook fails when restoring alloc state, as with a client agent restart, tasks don't get fully cleaned up and may leave orphan resources like a running container and network configuration (e.g. iptables rules).

This was pointed out in #13028 where specifically a CSI prerun hook fails, but it's an issue more generally with alloc runner prerun hooks.

I encountered it myself while investigating that issue, and as @ygersie put it,

The worst thing is that Nomad garbage collects the failed allocation but doesn't actually shutdown the docker container (checked the docker logs it never received the api call to stop it either), leaving a zombie container.

Reproduction steps

I made a strange hook to be able to poison an alloc on disk, so it can succeed first pass but fail after a client agent stop/start.

Expected Result

All of the failed task's resources are cleaned up.

Actual Result

The task is marked as failed and dead and gets replaced, but the old container remains running.

Also if the task uses a static port, the new one will fail to start because the port is held by the "failed" task.

@suikast42
Copy link
Contributor

have maybe some paralles with #17079 ?

@suikast42

This comment was marked as off-topic.

@tgross
Copy link
Member

tgross commented May 8, 2023

@suikast42 it seems like you've mixed up two different issues. Can you move the drain discussion back over to the drain ticket and not this one?

@gulducat
Copy link
Member Author

gulducat commented May 8, 2023

@suikast42 There are some conceptual parallels, in that both cases result in stuff getting left behind unexpectedly.

However in #17079 your logs indicate that task "Killing" is starting, which is one of the things that is currently not happening under the specific circumstances that cause this issue here. And one of the things that does happen appropriately in this case are task stop hooks, which include service deregistration (edit: and alloc postrun hooks for deregistering group-level services).

So good keeping watch, but these cases are definitely unrelated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants