fingerprint: don't clear Consul/Vault attributes on failure #14673

tgross · 2022-09-23T15:26:06Z

Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes:

(1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster.

(2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new node-update evalution for every system job on the cluster times the number of nodes, plus one node-update evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of no-op evaluations. (See also #14621 for mitigations of this.)

This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.

Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes: (1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster. (2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new `node-update` evalution for every system job on the cluster times the number of nodes, plus one `node-update` evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of evaluations. This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.

shoenig

LGTM! We might want to leave a note (upgrade guide?) about how to uninstall Consul/Vault, since the attributes will no longer be automatically cleared (just a Client restart?)

Extension of #14673 Once Vault is initially fingerprinted, extend the period since changes should be infrequent and the fingerprint is relatively expensive since it is contacting a central Vault server. Also move the period timer reset *after* the fingerprint. This is similar to #9435 where the idea is to ensure the retry period starts *after* the operation is attempted. 15s will be the *minimum* time between fingerprints now instead of the *maximum* time between fingerprints. In the case of Vault fingerprinting, the original behavior might cause the following: 1. Timer is reset to 15s 2. Fingerprint takes 16s 3. Timer has already elapsed so we immediately Fingerprint again Even if fingerprinting Vault only takes a few seconds, that may very well be due to excessive load and backing off our fingerprints is desirable. The new bevahior ensures we always wait at least 15s between fingerprint attempts and should allow some natural jittering based on server load and network latency.

github-actions · 2023-01-23T02:15:02Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

vercel bot deployed to Preview – nomad-storybook-and-ui September 23, 2022 15:30 View deployment

tgross added this to the 1.4.x milestone Sep 23, 2022

tgross added theme/client theme/vault theme/consul labels Sep 23, 2022

tgross force-pushed the f-vault-consul-fingerprint-fails branch from af5d5fe to 4b3db57 Compare September 23, 2022 15:31

vercel bot deployed to Preview – nomad-storybook-and-ui September 23, 2022 15:34 View deployment

tgross requested review from schmichael, shoenig and lgfa29 September 23, 2022 15:38

tgross marked this pull request as ready for review September 23, 2022 15:39

shoenig approved these changes Sep 23, 2022

View reviewed changes

upgrade guide note

0d99731

tgross modified the milestones: 1.4.x, 1.4.0 Sep 23, 2022

vercel bot deployed to Preview – nomad September 23, 2022 18:04 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui September 23, 2022 18:04 View deployment

tgross merged commit 786dc5f into main Sep 23, 2022

tgross deleted the f-vault-consul-fingerprint-fails branch September 23, 2022 18:45

schmichael mentioned this pull request Sep 26, 2022

fingerprint: lengthen Vault check after seen #14693

Merged

github-actions bot locked as resolved and limited conversation to collaborators Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fingerprint: don't clear Consul/Vault attributes on failure #14673

fingerprint: don't clear Consul/Vault attributes on failure #14673

tgross commented Sep 23, 2022 •

edited

Loading

shoenig left a comment

github-actions bot commented Jan 23, 2023

fingerprint: don't clear Consul/Vault attributes on failure #14673

fingerprint: don't clear Consul/Vault attributes on failure #14673

Conversation

tgross commented Sep 23, 2022 • edited Loading

shoenig left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 23, 2023

tgross commented Sep 23, 2022 •

edited

Loading