Skip to content

Commit

Permalink
csi: plugin instance manager should retry creating gRPC client
Browse files Browse the repository at this point in the history
Nomad communicates with CSI plugin tasks via gRPC. The plugin
supervisor hook uses this to ping the plugin for health checks which
it emits as task events. After the first successsful health check the
plugin supervisor registers the plugin in the client's dynamic plugin
registry, which in turn creates a CSI plugin manager instance that has
its own gRPC client for fingerprinting the plugin and sending mount
requests.

If the plugin manager instance fails to connect to the plugin on its
first attempt, it exits. The plugin supervisor hook is unaware that
connection failed so long as its own pings continue to work. A
transient failure during plugin startup may mislead the plugin
supervisor hook into thinking the plugin is up (so there's no need to
restart the allocation) but no fingerprinter is started. Update the
plugin manager instance to retry the gRPC client connection until
success.

Includes two other small improvements:
* The plugin supervisor hook creates a new gRPC client for every probe
  and then throws it away. Instead, reuse the client as we do for the
  plugin manager.
* The gRPC client constructor has a 1 second timeout. Clarify that this
  timeout applies to the connection and not the rest of the client
  lifetime.
  • Loading branch information
tgross committed Feb 11, 2022
1 parent 72e19c3 commit 0917995
Showing 1 changed file with 15 additions and 9 deletions.
24 changes: 15 additions & 9 deletions client/pluginmanager/csimanager/instance.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import (
)

const managerFingerprintInterval = 30 * time.Second
const managerFingerprintRetryInterval = 5 * time.Second

// instanceManager is used to manage the fingerprinting and supervision of a
// single CSI Plugin.
Expand Down Expand Up @@ -73,15 +74,6 @@ func newInstanceManager(logger hclog.Logger, eventer TriggerNodeEvent, updater U
}

func (i *instanceManager) run() {
c, err := csi.NewClient(i.info.ConnectionInfo.SocketPath, i.logger)
if err != nil {
i.logger.Error("failed to setup instance manager client", "error", err)
close(i.shutdownCh)
return
}
i.client = c
i.fp.client = c

go i.setupVolumeManager()
go i.runLoop()
}
Expand All @@ -96,6 +88,9 @@ func (i *instanceManager) setupVolumeManager() {
case <-i.shutdownCtx.Done():
return
case <-i.fp.hadFirstSuccessfulFingerprintCh:
// the runLoop goroutine populates i.client but we never get
// the first fingerprint until after it's been populated, so
// this is safe
i.volumeManager = newVolumeManager(i.logger, i.eventer, i.client, i.mountPoint, i.containerMountPoint, i.fp.requiresStaging)
i.logger.Debug("volume manager setup complete")
close(i.volumeManagerSetupCh)
Expand Down Expand Up @@ -142,6 +137,17 @@ func (i *instanceManager) runLoop() {
return

case <-timer.C:
if i.client == nil {
c, err := csi.NewClient(i.info.ConnectionInfo.SocketPath, i.logger)
if err != nil {
i.logger.Debug("failed to setup instance manager client", "error", err)
timer.Reset(managerFingerprintRetryInterval)
continue
}
i.client = c
i.fp.client = c
}

ctx, cancelFn := i.requestCtxWithTimeout(managerFingerprintInterval)
info := i.fp.fingerprint(ctx)
cancelFn()
Expand Down

0 comments on commit 0917995

Please sign in to comment.