-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gracefully recover tasks that use csi node plugins #16809
Conversation
so that on startup, clients can recover running tasks that use CSI volumes, instead of them being terminated and rescheduled because they need a volume that "doesn't exist" yet, only because the plugin task has not yet been recovered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good, @gulducat. I've left a few comments around some of the implementation details that we'll want to fix up.
// not specific RPC timeouts, but we manage the stream | ||
// lifetime via Close in the pluginmanager. | ||
mounter, err := c.csimanager.MounterForPlugin(c.shutdownCtx, pair.volume.PluginID) | ||
// make sure the plugin is ready or becomes so quickly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is misleading, because we're actually waiting up to 24hr?
Having the really long timeout on the unpublish workflow makes sense because we need to be able to recover the already-mounted volume from that state. But on the publish workflow we probably want to abandon the attempt within a short-ish amount of time so that we're not delaying a reschedule too much in the case where we've placed a new alloc and it's never going to work.
Maybe... 5min to allow for a very busy restoring client?
(Ideally we'd be able to distinguish between a "this is the first time" vs a "this is a restore", but we can do that and/or make the timeout tunable later.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to 24 hours is the last catch-all timeout on the registry end, just in case callers don't set their own timeouts like they should. The csiManager's WaitForPlugin() adds a timeout of 1 Minute because it has (or can have) more specific domain knowledge of its plugins' behavior.
I went back and forth on where to put which contexts with which timeout durations, and these values worked okay in my little sandbox. Should I bump csiManager's up to 5min? Or move them around some otherlyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, I missed the 1 min on csimanager.WaitForPlugin
. If we end up refactoring the postrun hook to use csimanager.WaitForPlugin
too, we might find we want to lift that timeout configuration into the caller (the prerun or postrun hook), but this should be fine as-is for now.
and remove debris
d2dcc14
to
1813c9b
Compare
Thanks! I made changes for everything except for the timeout durations, but happy to change that logic too as you may advise. Let me know what you think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, once you're happy that this solves the problem for non-hostpath plugins too. Don't forget to add a changelog and backport labels.
With happy running tasks that use a CSI volume, on client restart they may fail to be recovered and instead be killed and rescheduled, even though there is a healthy happy CSI plugin running right next to them.
The simplest version of the error condition I could replicate was using this repo's
demo/csi/hostpath
example. With the tasks all set up, I stop the client and start it again:The client doesn't know yet that the CSI plugin task is healthy, because it happens to try to recover the task that needs it first.
After this change, which retries for some time (one minute (is that ok?)) to find the plugin that it expects to exist:
We may wish to test with a more complex CSI setup to make sure that this fully Fixes #13028