gracefully recover tasks that use csi node plugins #16809

gulducat · 2023-04-05T22:43:11Z

With happy running tasks that use a CSI volume, on client restart they may fail to be recovered and instead be killed and rescheduled, even though there is a healthy happy CSI plugin running right next to them.

The simplest version of the error condition I could replicate was using this repo's demo/csi/hostpath example. With the tasks all set up, I stop the client and start it again:

2023-04-05T22:10:33.940Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 task=redis type=Received msg="Task received by client" failed=false
2023-04-05T22:10:33.948Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d013ee08-10fc-ff0e-ef5e-cc444048fbfd task=plugin type=Received msg="Task received by client" failed=false
2023-04-05T22:10:33.959Z [ERROR] client.alloc_runner: prerun failed: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 error="pre-run hook \"csi_hook\" failed: plugin hostpath-plugin0 for type csi-node not found"
2023-04-05T22:10:33.964Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 task=redis type="Setup Failure" msg="failed to setup alloc: pre-run hook \"csi_hook\" failed: plugin hostpath-plugin0 for type csi-node not found" failed=true
2023-04-05T22:10:33.974Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d013ee08-10fc-ff0e-ef5e-cc444048fbfd task=plugin type="Plugin became healthy" msg="plugin: hostpath-plugin0" failed=false
2023-04-05T22:10:36.973Z [INFO]  client.gc: marking allocation for GC: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279

The client doesn't know yet that the CSI plugin task is healthy, because it happens to try to recover the task that needs it first.

After this change, which retries for some time (one minute (is that ok?)) to find the plugin that it expects to exist:

2023-04-05T22:18:03.805Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=3f38e8b5-106b-87c7-9df4-b5815fa955d5 task=redis type=Received msg="Task received by client" failed=false
2023-04-05T22:18:03.819Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d6f262f3-b20d-89d2-1bdd-6f92a01c2003 task=plugin type=Received msg="Task received by client" failed=false
2023-04-05T22:18:04.116Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d6f262f3-b20d-89d2-1bdd-6f92a01c2003 task=plugin type="Plugin became healthy" msg="plugin: hostpath-plugin0" failed=false
2023-04-05T22:18:04.262Z [INFO]  client.alloc_runner.runner_hook.csi_hook: found CSI plugin: alloc_id=3f38e8b5-106b-87c7-9df4-b5815fa955d5 type=csi-node name=hostpath-plugin0

We may wish to test with a more complex CSI setup to make sure that this fully Fixes #13028

so that on startup, clients can recover running tasks that use CSI volumes, instead of them being terminated and rescheduled because they need a volume that "doesn't exist" yet, only because the plugin task has not yet been recovered.

tgross

This is looking really good, @gulducat. I've left a few comments around some of the implementation details that we'll want to fix up.

client/dynamicplugins/registry.go

client/pluginmanager/csimanager/manager.go

client/dynamicplugins/registry.go

tgross · 2023-04-06T12:46:08Z

client/allocrunner/csi_hook.go

-		// not specific RPC timeouts, but we manage the stream
-		// lifetime via Close in the pluginmanager.
-		mounter, err := c.csimanager.MounterForPlugin(c.shutdownCtx, pair.volume.PluginID)
+		// make sure the plugin is ready or becomes so quickly.


I think this comment is misleading, because we're actually waiting up to 24hr?

Having the really long timeout on the unpublish workflow makes sense because we need to be able to recover the already-mounted volume from that state. But on the publish workflow we probably want to abandon the attempt within a short-ish amount of time so that we're not delaying a reschedule too much in the case where we've placed a new alloc and it's never going to work.

Maybe... 5min to allow for a very busy restoring client?

(Ideally we'd be able to distinguish between a "this is the first time" vs a "this is a restore", but we can do that and/or make the timeout tunable later.)

Up to 24 hours is the last catch-all timeout on the registry end, just in case callers don't set their own timeouts like they should. The csiManager's WaitForPlugin() adds a timeout of 1 Minute because it has (or can have) more specific domain knowledge of its plugins' behavior.

I went back and forth on where to put which contexts with which timeout durations, and these values worked okay in my little sandbox. Should I bump csiManager's up to 5min? Or move them around some otherlyways?

Ah, ok, I missed the 1 min on csimanager.WaitForPlugin. If we end up refactoring the postrun hook to use csimanager.WaitForPlugin too, we might find we want to lift that timeout configuration into the caller (the prerun or postrun hook), but this should be fine as-is for now.

and remove debris

gulducat · 2023-04-06T18:10:31Z

Thanks! I made changes for everything except for the timeout durations, but happy to change that logic too as you may advise. Let me know what you think!

tgross

LGTM, once you're happy that this solves the problem for non-hostpath plugins too. Don't forget to add a changelog and backport labels.

new WaitForPlugin() during csiHook.Prerun

9ae2c9d

so that on startup, clients can recover running tasks that use CSI volumes, instead of them being terminated and rescheduled because they need a volume that "doesn't exist" yet, only because the plugin task has not yet been recovered.

gulducat requested a review from tgross April 5, 2023 22:43

reuse one timer

b772ad5

vercel bot deployed to Preview – nomad-storybook-and-ui April 5, 2023 22:52 View deployment

tgross requested changes Apr 6, 2023

View reviewed changes

lock instances, Timer differently, snake_case log

1813c9b

and remove debris

gulducat force-pushed the fix-13028-csi-volumes-on-client-start branch from d2dcc14 to 1813c9b Compare April 6, 2023 18:06

vercel bot deployed to Preview – nomad-storybook-and-ui April 6, 2023 18:10 View deployment

tgross approved these changes Apr 6, 2023

View reviewed changes

add changelog

10a2678

gulducat added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Apr 10, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui April 10, 2023 20:50 View deployment

gulducat changed the title ~~new WaitForPlugin() during csiHook.Prerun~~ gracefully recover tasks that use csi node plugins Apr 10, 2023

gulducat merged commit dbe8365 into main Apr 10, 2023

gulducat deleted the fix-13028-csi-volumes-on-client-start branch April 10, 2023 22:15

hc-github-team-nomad-core mentioned this pull request Apr 10, 2023

Backport of gracefully recover tasks that use csi node plugins into release/1.3.x #16837

Closed

gulducat added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line and removed backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Apr 11, 2023

This was referenced Apr 11, 2023

Backport of gracefully recover tasks that use csi node plugins into release/1.4.x #16847

Closed

Backport of gracefully recover tasks that use csi node plugins into release/1.5.x #16848

Merged

gulducat mentioned this pull request Apr 11, 2023

Allocations with CSI volume fail during Nomad agent restart #13028

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gracefully recover tasks that use csi node plugins #16809

gracefully recover tasks that use csi node plugins #16809

gulducat commented Apr 5, 2023 •

edited

Loading

tgross left a comment

tgross Apr 6, 2023

gulducat Apr 6, 2023

tgross Apr 6, 2023

gulducat commented Apr 6, 2023

tgross left a comment

gracefully recover tasks that use csi node plugins #16809

gracefully recover tasks that use csi node plugins #16809

Conversation

gulducat commented Apr 5, 2023 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

tgross Apr 6, 2023

Choose a reason for hiding this comment

gulducat Apr 6, 2023

Choose a reason for hiding this comment

tgross Apr 6, 2023

Choose a reason for hiding this comment

gulducat commented Apr 6, 2023

tgross left a comment

Choose a reason for hiding this comment

gulducat commented Apr 5, 2023 •

edited

Loading