Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fingerprint: don't clear Consul/Vault attributes on failure #14673

Merged
merged 2 commits into from
Sep 23, 2022

Commits on Sep 23, 2022

  1. fingerprint: don't clear Consul/Vault attributes on failure

    Clients periodically fingerprint Vault and Consul to ensure the server has
    updated attributes in the client's fingerprint. If the client can't reach
    Vault/Consul, the fingerprinter clears the attributes and requires a node
    update. Although this seems like correct behavior so that we can detect
    intentional removal of Vault/Consul access, it has two serious failure modes:
    
    (1) If a local Consul agent is restarted to pick up configuration changes and the
    client happens to fingerprint at that moment, the client will update its
    fingerprint and result in evaluations for all its jobs and all the system jobs
    in the cluster.
    
    (2) If a client loses Vault connectivity, the same thing happens. But the
    consequences are much worse in the Vault case because Vault is not run as a
    local agent, so Vault connectivity failures are highly correlated across the
    entire cluster. A 15 second Vault outage will cause a new `node-update`
    evalution for every system job on the cluster times the number of nodes, plus
    one `node-update` evaluation for every non-system job on each node. On large
    clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.
    
    This changeset updates the fingerprinting behavior to keep the last fingerprint
    if Consul or Vault queries fail. This prevents a storm of evaluations at the
    cost of requiring a client restart if Consul or Vault is intentionally removed
    from the client.
    tgross committed Sep 23, 2022
    Configuration menu
    Copy the full SHA
    4b3db57 View commit details
    Browse the repository at this point in the history
  2. upgrade guide note

    tgross committed Sep 23, 2022
    Configuration menu
    Copy the full SHA
    0d99731 View commit details
    Browse the repository at this point in the history