CSI: make gRPC client creation more robust #12057

tgross · 2022-02-11T19:24:10Z

Related to #11784. It doesn't fix it but makes debugging it more legible.
I've broken this up into mostly bite-sized commits for review.

Nomad communicates with CSI plugin tasks via gRPC. The plugin
supervisor hook uses this to ping the plugin for health checks which
it emits as task events. After the first successful health check the
plugin supervisor registers the plugin in the client's dynamic plugin
registry, which in turn creates a CSI plugin manager instance that has
its own gRPC client for fingerprinting the plugin and sending mount
requests.

If the plugin manager instance fails to connect to the plugin on its
first attempt, it exits. The plugin supervisor hook is unaware that
connection failed so long as its own pings continue to work. A
transient failure during plugin startup may mislead the plugin
supervisor hook into thinking the plugin is up (so there's no need to
restart the allocation) but no fingerprinter is started.

Refactors the gRPC client to connect on first use. This provides the
plugin manager instance the ability to retry the gRPC client
connection until success.
Add a 30s timeout to the plugin supervisor so that we don't poll
forever waiting for a plugin that will never come back up.

Minor improvements:

The plugin supervisor hook creates a new gRPC client for every probe
and then throws it away. Instead, reuse the client as we do for the
plugin manager.
The gRPC client constructor has a 1 second timeout. Clarify that this
timeout applies to the connection and not the rest of the client
lifetime.

client/allocrunner/taskrunner/plugin_supervisor_hook.go

client/pluginmanager/csimanager/instance.go

The plugin supervisor registers the plugin in the `Poststart` hook, so the task itself should be running. If the plugin can't communicate with us after 30s, exit and mark the task as unhealthy so that it can be restarted.

shoenig

LGTM; just the one question about kill

client/allocrunner/taskrunner/plugin_supervisor_hook.go

github-actions · 2022-10-18T02:46:48Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

csi: clarify gRPC connect timeout parameter

787acdf

tgross force-pushed the csi-plugin-client-restarts branch from df76de4 to 0917995 Compare February 11, 2022 19:42

vercel bot deployed to Preview – nomad-storybook-and-ui February 11, 2022 19:42 View deployment

vercel bot temporarily deployed to Preview – nomad February 11, 2022 19:42 Inactive

vercel bot temporarily deployed to Preview – nomad February 11, 2022 19:44 Inactive

tgross added theme/storage stage/needs-backporting labels Feb 11, 2022

tgross added this to the 1.3.0 milestone Feb 11, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui February 11, 2022 19:45 View deployment

tgross mentioned this pull request Feb 11, 2022

CSI plugin fails to be marked healthy after reboot #11784

Closed

tgross commented Feb 11, 2022

View reviewed changes

client/allocrunner/taskrunner/plugin_supervisor_hook.go Outdated Show resolved Hide resolved

tgross force-pushed the csi-plugin-client-restarts branch from a582d24 to b490349 Compare February 11, 2022 20:58

vercel bot deployed to Preview – nomad-storybook-and-ui February 11, 2022 20:58 View deployment

vercel bot temporarily deployed to Preview – nomad February 11, 2022 20:58 Inactive

tgross force-pushed the csi-plugin-client-restarts branch from b490349 to 6b27231 Compare February 11, 2022 22:02

vercel bot deployed to Preview – nomad-storybook-and-ui February 11, 2022 22:02 View deployment

vercel bot temporarily deployed to Preview – nomad February 11, 2022 22:02 Inactive

tgross commented Feb 14, 2022

View reviewed changes

client/pluginmanager/csimanager/instance.go Outdated Show resolved Hide resolved

tgross force-pushed the csi-plugin-client-restarts branch from 6b27231 to 7d0f22d Compare February 14, 2022 21:06

vercel bot deployed to Preview – nomad-storybook-and-ui February 14, 2022 21:07 View deployment

vercel bot temporarily deployed to Preview – nomad February 14, 2022 21:07 Inactive

tgross force-pushed the csi-plugin-client-restarts branch from 7d0f22d to 4193588 Compare February 14, 2022 21:17

vercel bot deployed to Preview – nomad-storybook-and-ui February 14, 2022 21:17 View deployment

vercel bot temporarily deployed to Preview – nomad February 14, 2022 21:17 Inactive

tgross force-pushed the csi-plugin-client-restarts branch from 4193588 to f3d72e6 Compare February 14, 2022 21:18

vercel bot temporarily deployed to Preview – nomad February 14, 2022 21:19 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui February 14, 2022 21:19 View deployment

tgross force-pushed the csi-plugin-client-restarts branch from f3d72e6 to bda257d Compare February 14, 2022 21:23

tgross force-pushed the csi-plugin-client-restarts branch from f8f6440 to 33e03fc Compare February 15, 2022 16:27

vercel bot deployed to Preview – nomad-storybook-and-ui February 15, 2022 16:27 View deployment

vercel bot temporarily deployed to Preview – nomad February 15, 2022 16:27 Inactive

tgross force-pushed the csi-plugin-client-restarts branch from 33e03fc to a27bd89 Compare February 15, 2022 16:43

vercel bot deployed to Preview – nomad-storybook-and-ui February 15, 2022 16:43 View deployment

vercel bot temporarily deployed to Preview – nomad February 15, 2022 16:43 Inactive

tgross added 2 commits February 15, 2022 13:32

csi: exit plugin supervisor after 30s without initial connection

a42c1cf

The plugin supervisor registers the plugin in the `Poststart` hook, so the task itself should be running. If the plugin can't communicate with us after 30s, exit and mark the task as unhealthy so that it can be restarted.

changelog entry

4804d2f

tgross force-pushed the csi-plugin-client-restarts branch from a27bd89 to 4804d2f Compare February 15, 2022 18:33

vercel bot temporarily deployed to Preview – nomad February 15, 2022 18:33 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui February 15, 2022 18:33 View deployment

tgross marked this pull request as ready for review February 15, 2022 19:05

tgross requested review from shoenig, DerekStrickland and lgfa29 February 15, 2022 19:12

shoenig approved these changes Feb 15, 2022

View reviewed changes

client/allocrunner/taskrunner/plugin_supervisor_hook.go Show resolved Hide resolved

address comments from code review

dfadb2b

vercel bot deployed to Preview – nomad-storybook-and-ui February 15, 2022 21:09 View deployment

vercel bot temporarily deployed to Preview – nomad February 15, 2022 21:09 Inactive

tgross merged commit b775a73 into main Feb 15, 2022

tgross deleted the csi-plugin-client-restarts branch February 15, 2022 21:57

lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Apr 19, 2022

This was referenced Apr 19, 2022

Backport of CSI: make gRPC client creation more robust into release/1.2.x #12635

Merged

Backport of CSI: make gRPC client creation more robust into release/1.1.x #12636

Merged

lgfa29 removed stage/needs-backporting labels Apr 19, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: make gRPC client creation more robust #12057

CSI: make gRPC client creation more robust #12057

tgross commented Feb 11, 2022 •

edited

Loading

shoenig left a comment

github-actions bot commented Oct 18, 2022

CSI: make gRPC client creation more robust #12057

CSI: make gRPC client creation more robust #12057

Conversation

tgross commented Feb 11, 2022 • edited Loading

shoenig left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 18, 2022

tgross commented Feb 11, 2022 •

edited

Loading