Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of CSI: make gRPC client creation more robust into release/1.1.x #12636

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #12057 to be assessed for backporting due to the inclusion of the label backport/1.1.x.

The below text is copied from the body of the original PR.


Related to #11784. It doesn't fix it but makes debugging it more legible.
I've broken this up into mostly bite-sized commits for review.


Nomad communicates with CSI plugin tasks via gRPC. The plugin
supervisor hook uses this to ping the plugin for health checks which
it emits as task events. After the first successful health check the
plugin supervisor registers the plugin in the client's dynamic plugin
registry, which in turn creates a CSI plugin manager instance that has
its own gRPC client for fingerprinting the plugin and sending mount
requests.

If the plugin manager instance fails to connect to the plugin on its
first attempt, it exits. The plugin supervisor hook is unaware that
connection failed so long as its own pings continue to work. A
transient failure during plugin startup may mislead the plugin
supervisor hook into thinking the plugin is up (so there's no need to
restart the allocation) but no fingerprinter is started.

  • Refactors the gRPC client to connect on first use. This provides the
    plugin manager instance the ability to retry the gRPC client
    connection until success.
  • Add a 30s timeout to the plugin supervisor so that we don't poll
    forever waiting for a plugin that will never come back up.

Minor improvements:

  • The plugin supervisor hook creates a new gRPC client for every probe
    and then throws it away. Instead, reuse the client as we do for the
    plugin manager.
  • The gRPC client constructor has a 1 second timeout. Clarify that this
    timeout applies to the connection and not the rest of the client
    lifetime.

@hc-github-team-nomad-core hc-github-team-nomad-core merged commit 2285548 into release/1.1.x Apr 19, 2022
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/csi-plugin-client-restarts/perfectly-super-garfish branch April 19, 2022 17:40
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants