Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gracefully recover tasks that use csi node plugins #16809

Merged
merged 4 commits into from
Apr 10, 2023

Conversation

gulducat
Copy link
Member

@gulducat gulducat commented Apr 5, 2023

With happy running tasks that use a CSI volume, on client restart they may fail to be recovered and instead be killed and rescheduled, even though there is a healthy happy CSI plugin running right next to them.

The simplest version of the error condition I could replicate was using this repo's demo/csi/hostpath example. With the tasks all set up, I stop the client and start it again:

2023-04-05T22:10:33.940Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 task=redis type=Received msg="Task received by client" failed=false
2023-04-05T22:10:33.948Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d013ee08-10fc-ff0e-ef5e-cc444048fbfd task=plugin type=Received msg="Task received by client" failed=false
2023-04-05T22:10:33.959Z [ERROR] client.alloc_runner: prerun failed: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 error="pre-run hook \"csi_hook\" failed: plugin hostpath-plugin0 for type csi-node not found"
2023-04-05T22:10:33.964Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 task=redis type="Setup Failure" msg="failed to setup alloc: pre-run hook \"csi_hook\" failed: plugin hostpath-plugin0 for type csi-node not found" failed=true
2023-04-05T22:10:33.974Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d013ee08-10fc-ff0e-ef5e-cc444048fbfd task=plugin type="Plugin became healthy" msg="plugin: hostpath-plugin0" failed=false
2023-04-05T22:10:36.973Z [INFO]  client.gc: marking allocation for GC: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279

The client doesn't know yet that the CSI plugin task is healthy, because it happens to try to recover the task that needs it first.

After this change, which retries for some time (one minute (is that ok?)) to find the plugin that it expects to exist:

2023-04-05T22:18:03.805Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=3f38e8b5-106b-87c7-9df4-b5815fa955d5 task=redis type=Received msg="Task received by client" failed=false
2023-04-05T22:18:03.819Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d6f262f3-b20d-89d2-1bdd-6f92a01c2003 task=plugin type=Received msg="Task received by client" failed=false
2023-04-05T22:18:04.116Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d6f262f3-b20d-89d2-1bdd-6f92a01c2003 task=plugin type="Plugin became healthy" msg="plugin: hostpath-plugin0" failed=false
2023-04-05T22:18:04.262Z [INFO]  client.alloc_runner.runner_hook.csi_hook: found CSI plugin: alloc_id=3f38e8b5-106b-87c7-9df4-b5815fa955d5 type=csi-node name=hostpath-plugin0

We may wish to test with a more complex CSI setup to make sure that this fully Fixes #13028

so that on startup, clients can recover running
tasks that use CSI volumes, instead of them being
terminated and rescheduled because they need a
volume that "doesn't exist" yet, only because
the plugin task has not yet been recovered.
@gulducat gulducat requested a review from tgross April 5, 2023 22:43
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good, @gulducat. I've left a few comments around some of the implementation details that we'll want to fix up.

client/dynamicplugins/registry.go Outdated Show resolved Hide resolved
client/pluginmanager/csimanager/manager.go Outdated Show resolved Hide resolved
client/pluginmanager/csimanager/manager.go Show resolved Hide resolved
client/dynamicplugins/registry.go Outdated Show resolved Hide resolved
// not specific RPC timeouts, but we manage the stream
// lifetime via Close in the pluginmanager.
mounter, err := c.csimanager.MounterForPlugin(c.shutdownCtx, pair.volume.PluginID)
// make sure the plugin is ready or becomes so quickly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is misleading, because we're actually waiting up to 24hr?

Having the really long timeout on the unpublish workflow makes sense because we need to be able to recover the already-mounted volume from that state. But on the publish workflow we probably want to abandon the attempt within a short-ish amount of time so that we're not delaying a reschedule too much in the case where we've placed a new alloc and it's never going to work.

Maybe... 5min to allow for a very busy restoring client?

(Ideally we'd be able to distinguish between a "this is the first time" vs a "this is a restore", but we can do that and/or make the timeout tunable later.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to 24 hours is the last catch-all timeout on the registry end, just in case callers don't set their own timeouts like they should. The csiManager's WaitForPlugin() adds a timeout of 1 Minute because it has (or can have) more specific domain knowledge of its plugins' behavior.

I went back and forth on where to put which contexts with which timeout durations, and these values worked okay in my little sandbox. Should I bump csiManager's up to 5min? Or move them around some otherlyways?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, I missed the 1 min on csimanager.WaitForPlugin. If we end up refactoring the postrun hook to use csimanager.WaitForPlugin too, we might find we want to lift that timeout configuration into the caller (the prerun or postrun hook), but this should be fine as-is for now.

@gulducat gulducat force-pushed the fix-13028-csi-volumes-on-client-start branch from d2dcc14 to 1813c9b Compare April 6, 2023 18:06
@gulducat
Copy link
Member Author

gulducat commented Apr 6, 2023

Thanks! I made changes for everything except for the timeout durations, but happy to change that logic too as you may advise. Let me know what you think!

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, once you're happy that this solves the problem for non-hostpath plugins too. Don't forget to add a changelog and backport labels.

@gulducat gulducat added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Apr 10, 2023
@gulducat gulducat changed the title new WaitForPlugin() during csiHook.Prerun gracefully recover tasks that use csi node plugins Apr 10, 2023
@gulducat gulducat merged commit dbe8365 into main Apr 10, 2023
@gulducat gulducat deleted the fix-13028-csi-volumes-on-client-start branch April 10, 2023 22:15
@gulducat gulducat added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line and removed backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allocations with CSI volume fail during Nomad agent restart
2 participants