Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: plugin fingerprinting should not be used for repeated health checking #12056

Open
tgross opened this issue Feb 11, 2022 · 1 comment
Open

Comments

@tgross
Copy link
Member

tgross commented Feb 11, 2022

After a CSI plugin task is started, the plugin_supervisor_hook sends the Probe RPC to verify the plugin is up and healthy. Once it's up and healthy, we call the GetPluginInfo RPC and then register the plugin which starts up a CSI manager instance which runs fingerprinting and also serves as the entrypoint for other tasks to request volume mounts (ex. from csi_hook).

There are two ongoing loops:

  • The csimanager uses various get-capability RPCs to fingerprint the plugin and sets the structs.CSIInfo health based on whether it can do so. This is important so that we wait until we've fingerprinted the plugin to tell the server that it's healthy. It then sets up a loop that re-sends these every 30s.
  • The plugin_supervisor_hook uses the probe RPC loop to emit task events. If these become unhealthy it does not deregister the plugin to trigger anything in the csimanager, so this task event really only tells us whether the gRPC client can connect.

As a consequence we're sending 2-4 RPCs to the plugins every 30s instead of 1. We don't expect the plugin fingerprint to change after we initially receive it (if a plugin does this it is bad and should feel bad 😀 ). And the only purpose of the plugin_supervisor_hook loop is to trigger task events and to deregister the plugin when the task shuts down (which ultimately shuts down the csimanager/instance as well).

Proposal:

  • The plugin_supervisor_hook sends probe RPCs until it succeeds, then registers the plugin.
  • The csimanager/instance sends the fingerprint RPCs until it succeeds, which updates the initial fingerprint.
  • The csimanager/instance then "decays" to sending probe RPCs periodically, updating the fingerprint health on this basis.
  • Add an interface to csimanager to check the health of a plugin (similar to the existing csimanager.MounterForPlugin() method), which just grabs the most recent fingerprint data.
  • The plugin_supervisor_hook polls this interface every 30 and emits a task event on change, so we still get task events as expected.

Note this isn't a correctness issue but it's inefficient and does spam the plugin allocation's logs, which isn't a very nice experience for operators. I ran into again this while debugging #11784 and wanted to make sure I got it all down for future work. We won't block CSI's GA for this one.

@tgross
Copy link
Member Author

tgross commented Mar 31, 2023

Leaving a note here about the slow plugin registration issue that's related to this. Here's some logs from some recent testing I was doing. Note that the plugin is marked healthy in the task runner and then it's a full 7 seconds before the node fingerprint is written:

2023-03-31T14:04:11.359-0400 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=c99748e0-a349-730d-6c2d-a7f5f39a065a task=plugin type="Plugin became healthy" msg="plugin: hostpath-plugin0" failed=false
2023-03-31T14:04:18.595-0400 [INFO] client: node registration complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant