CSI: plugin fingerprinting should not be used for repeated health checking #12056

tgross · 2022-02-11T16:16:26Z

After a CSI plugin task is started, the plugin_supervisor_hook sends the Probe RPC to verify the plugin is up and healthy. Once it's up and healthy, we call the GetPluginInfo RPC and then register the plugin which starts up a CSI manager instance which runs fingerprinting and also serves as the entrypoint for other tasks to request volume mounts (ex. from csi_hook).

There are two ongoing loops:

The csimanager uses various get-capability RPCs to fingerprint the plugin and sets the structs.CSIInfo health based on whether it can do so. This is important so that we wait until we've fingerprinted the plugin to tell the server that it's healthy. It then sets up a loop that re-sends these every 30s.
The plugin_supervisor_hook uses the probe RPC loop to emit task events. If these become unhealthy it does not deregister the plugin to trigger anything in the csimanager, so this task event really only tells us whether the gRPC client can connect.

As a consequence we're sending 2-4 RPCs to the plugins every 30s instead of 1. We don't expect the plugin fingerprint to change after we initially receive it (if a plugin does this it is bad and should feel bad 😀 ). And the only purpose of the plugin_supervisor_hook loop is to trigger task events and to deregister the plugin when the task shuts down (which ultimately shuts down the csimanager/instance as well).

Proposal:

The plugin_supervisor_hook sends probe RPCs until it succeeds, then registers the plugin.
The csimanager/instance sends the fingerprint RPCs until it succeeds, which updates the initial fingerprint.
The csimanager/instance then "decays" to sending probe RPCs periodically, updating the fingerprint health on this basis.
Add an interface to csimanager to check the health of a plugin (similar to the existing csimanager.MounterForPlugin() method), which just grabs the most recent fingerprint data.
The plugin_supervisor_hook polls this interface every 30 and emits a task event on change, so we still get task events as expected.

Note this isn't a correctness issue but it's inefficient and does spam the plugin allocation's logs, which isn't a very nice experience for operators. I ran into again this while debugging #11784 and wanted to make sure I got it all down for future work. We won't block CSI's GA for this one.

The text was updated successfully, but these errors were encountered:

tgross · 2023-03-31T18:05:56Z

Leaving a note here about the slow plugin registration issue that's related to this. Here's some logs from some recent testing I was doing. Note that the plugin is marked healthy in the task runner and then it's a full 7 seconds before the node fingerprint is written:

2023-03-31T14:04:11.359-0400 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=c99748e0-a349-730d-6c2d-a7f5f39a065a task=plugin type="Plugin became healthy" msg="plugin: hostpath-plugin0" failed=false
2023-03-31T14:04:18.595-0400 [INFO] client: node registration complete

tgross added type/enhancement theme/storage theme/bad-ux labels Feb 11, 2022

tgross mentioned this issue Feb 24, 2022

clients should not fingerprint CSI node-only plugins with GetNodeInfo #10055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: plugin fingerprinting should not be used for repeated health checking #12056

CSI: plugin fingerprinting should not be used for repeated health checking #12056

tgross commented Feb 11, 2022 •

edited

Loading

tgross commented Mar 31, 2023 •

edited

Loading

CSI: plugin fingerprinting should not be used for repeated health checking #12056

CSI: plugin fingerprinting should not be used for repeated health checking #12056

Comments

tgross commented Feb 11, 2022 • edited Loading

tgross commented Mar 31, 2023 • edited Loading

tgross commented Feb 11, 2022 •

edited

Loading

tgross commented Mar 31, 2023 •

edited

Loading