You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After a CSI plugin task is started, the plugin_supervisor_hook sends the Probe RPC to verify the plugin is up and healthy. Once it's up and healthy, we call the GetPluginInfo RPC and then register the plugin which starts up a CSI manager instance which runs fingerprinting and also serves as the entrypoint for other tasks to request volume mounts (ex. from csi_hook).
There are two ongoing loops:
The csimanager uses various get-capability RPCs to fingerprint the plugin and sets the structs.CSIInfo health based on whether it can do so. This is important so that we wait until we've fingerprinted the plugin to tell the server that it's healthy. It then sets up a loop that re-sends these every 30s.
The plugin_supervisor_hook uses the probe RPC loop to emit task events. If these become unhealthy it does not deregister the plugin to trigger anything in the csimanager, so this task event really only tells us whether the gRPC client can connect.
As a consequence we're sending 2-4 RPCs to the plugins every 30s instead of 1. We don't expect the plugin fingerprint to change after we initially receive it (if a plugin does this it is bad and should feel bad 😀 ). And the only purpose of the plugin_supervisor_hook loop is to trigger task events and to deregister the plugin when the task shuts down (which ultimately shuts down the csimanager/instance as well).
Proposal:
The plugin_supervisor_hook sends probe RPCs until it succeeds, then registers the plugin.
The csimanager/instance sends the fingerprint RPCs until it succeeds, which updates the initial fingerprint.
The csimanager/instance then "decays" to sending probe RPCs periodically, updating the fingerprint health on this basis.
Add an interface to csimanager to check the health of a plugin (similar to the existing csimanager.MounterForPlugin() method), which just grabs the most recent fingerprint data.
The plugin_supervisor_hook polls this interface every 30 and emits a task event on change, so we still get task events as expected.
Note this isn't a correctness issue but it's inefficient and does spam the plugin allocation's logs, which isn't a very nice experience for operators. I ran into again this while debugging #11784 and wanted to make sure I got it all down for future work. We won't block CSI's GA for this one.
The text was updated successfully, but these errors were encountered:
Leaving a note here about the slow plugin registration issue that's related to this. Here's some logs from some recent testing I was doing. Note that the plugin is marked healthy in the task runner and then it's a full 7 seconds before the node fingerprint is written:
After a CSI plugin task is started, the
plugin_supervisor_hook
sends theProbe
RPC to verify the plugin is up and healthy. Once it's up and healthy, we call theGetPluginInfo
RPC and then register the plugin which starts up a CSI manager instance which runs fingerprinting and also serves as the entrypoint for other tasks to request volume mounts (ex. fromcsi_hook
).There are two ongoing loops:
csimanager
uses various get-capability RPCs to fingerprint the plugin and sets thestructs.CSIInfo
health based on whether it can do so. This is important so that we wait until we've fingerprinted the plugin to tell the server that it's healthy. It then sets up a loop that re-sends these every 30s.plugin_supervisor_hook
uses the probe RPC loop to emit task events. If these become unhealthy it does not deregister the plugin to trigger anything in thecsimanager
, so this task event really only tells us whether the gRPC client can connect.As a consequence we're sending 2-4 RPCs to the plugins every 30s instead of 1. We don't expect the plugin fingerprint to change after we initially receive it (if a plugin does this it is bad and should feel bad 😀 ). And the only purpose of the
plugin_supervisor_hook
loop is to trigger task events and to deregister the plugin when the task shuts down (which ultimately shuts down thecsimanager/instance
as well).Proposal:
plugin_supervisor_hook
sends probe RPCs until it succeeds, then registers the plugin.csimanager/instance
sends the fingerprint RPCs until it succeeds, which updates the initial fingerprint.csimanager/instance
then "decays" to sending probe RPCs periodically, updating the fingerprint health on this basis.csimanager
to check the health of a plugin (similar to the existingcsimanager.MounterForPlugin()
method), which just grabs the most recent fingerprint data.plugin_supervisor_hook
polls this interface every 30 and emits a task event on change, so we still get task events as expected.Note this isn't a correctness issue but it's inefficient and does spam the plugin allocation's logs, which isn't a very nice experience for operators. I ran into again this while debugging #11784 and wanted to make sure I got it all down for future work. We won't block CSI's GA for this one.
The text was updated successfully, but these errors were encountered: