Task leak if alloc prerun hook fails after client restart #17102

gulducat · 2023-05-05T20:16:35Z

If a prerun hook fails when restoring alloc state, as with a client agent restart, tasks don't get fully cleaned up and may leave orphan resources like a running container and network configuration (e.g. iptables rules).

This was pointed out in #13028 where specifically a CSI prerun hook fails, but it's an issue more generally with alloc runner prerun hooks.

I encountered it myself while investigating that issue, and as @ygersie put it,

The worst thing is that Nomad garbage collects the failed allocation but doesn't actually shutdown the docker container (checked the docker logs it never received the api call to stop it either), leaving a zombie container.

Reproduction steps

I made a strange hook to be able to poison an alloc on disk, so it can succeed first pass but fail after a client agent stop/start.

Expected Result

All of the failed task's resources are cleaned up.

Actual Result

The task is marked as failed and dead and gets replaced, but the old container remains running.

Also if the task uses a static port, the new one will fail to start because the port is held by the "failed" task.

suikast42 · 2023-05-06T14:43:05Z

have maybe some paralles with #17079 ?

tgross · 2023-05-08T12:45:05Z

@suikast42 it seems like you've mixed up two different issues. Can you move the drain discussion back over to the drain ticket and not this one?

gulducat · 2023-05-08T17:25:36Z

@suikast42 There are some conceptual parallels, in that both cases result in stuff getting left behind unexpectedly.

However in #17079 your logs indicate that task "Killing" is starting, which is one of the things that is currently not happening under the specific circumstances that cause this issue here. And one of the things that does happen appropriately in this case are task stop hooks, which include service deregistration (edit: and alloc postrun hooks for deregistering group-level services).

So good keeping watch, but these cases are definitely unrelated!

gulducat added the type/bug label May 5, 2023

gulducat self-assigned this May 5, 2023

gulducat mentioned this issue May 6, 2023

Fix task leak during client restore when allocrunner prerun hook fails #17104

Merged

This comment was marked as off-topic.

Sign in to view

gulducat closed this as completed in #17104 May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task leak if alloc prerun hook fails after client restart #17102

Task leak if alloc prerun hook fails after client restart #17102

gulducat commented May 5, 2023

suikast42 commented May 6, 2023

This comment was marked as off-topic.

tgross commented May 8, 2023

gulducat commented May 8, 2023 •

edited

Loading

Task leak if alloc prerun hook fails after client restart #17102

Task leak if alloc prerun hook fails after client restart #17102

Comments

gulducat commented May 5, 2023

Reproduction steps

Expected Result

Actual Result

suikast42 commented May 6, 2023

This comment was marked as off-topic.

tgross commented May 8, 2023

gulducat commented May 8, 2023 • edited Loading

gulducat commented May 8, 2023 •

edited

Loading