Skip to content

Commit

Permalink
client: wait for batched driver updated
Browse files Browse the repository at this point in the history
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes #5579
  • Loading branch information
Mahmood Ali authored and preetapan committed Apr 22, 2019
1 parent 58362da commit 6efb949
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 1 deletion.
12 changes: 12 additions & 0 deletions client/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,11 @@ const (
allocSyncRetryIntv = 5 * time.Second
)

var (
// grace period to allow for batch fingerprint processing
batchFirstFingerprintsProcessingGrace = batchFirstFingerprintsTimeout + 5*time.Second
)

// ClientStatsReporter exposes all the APIs related to resource usage of a Nomad
// Client
type ClientStatsReporter interface {
Expand Down Expand Up @@ -416,6 +421,13 @@ func NewClient(cfg *config.Config, consulCatalog consul.CatalogAPI, consulServic
return nil, fmt.Errorf("failed to setup vault client: %v", err)
}

// wait until drivers are healthy before restoring or registering with servers
select {
case <-c.Ready():
case <-time.After(batchFirstFingerprintsProcessingGrace):
logger.Warn("batched fingerprint, registering node with registered so far")
}

// Restore the state
if err := c.restoreState(); err != nil {
logger.Error("failed to restore state", "error", err)
Expand Down
2 changes: 1 addition & 1 deletion client/node_updater.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import (
var (
// batchFirstFingerprintsTimeout is the maximum amount of time to wait for
// initial fingerprinting to complete before sending a batched Node update
batchFirstFingerprintsTimeout = 5 * time.Second
batchFirstFingerprintsTimeout = 50 * time.Second
)

// batchFirstFingerprints waits for the first fingerprint response from all
Expand Down

0 comments on commit 6efb949

Please sign in to comment.