Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of keyring: fixes for keyring replication on cluster join into release/1.4.x #15010

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #14987 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

The below text is copied from the body of the original PR.


Fixes #14981. Note to reviewers: these are definitely bugs but I don't have 100% confidence we've solved the problem because there's a log line missing in the reports from both the issue author and our internal user that I would expect to see. So I'm still trying to repro exactly, but it's worth getting some early eyes on this PR anyways.

  • Don't unblock early if rate limit burst exceeded. The rate limiter returns an error and unblocks early if its burst limit is exceeded (unless the burst limit is Inf). Ensure we're not unblocking early, otherwise we'll only slow down the cases where we're already pausing to make external RPC requests.

  • Set MinQueryIndex on stale queries. When keyring replication makes a stale query to non-leader peers to find a key the leader doesn't have, we need to make sure the peer we're querying has had a chance to catch up to the most current index for that key. Otherwise it's possible for newly-added servers to query another newly-added server and get a non-error nil response for that key ID.

  • Note that the "not found" case does not return an error, just an empty key. Update the handling of empty responses so that we don't break the loop early if we hit a server that doesn't have the key. (Peers aren't shuffled so we'd expect to hit the same server repeatedly.)

  • Move the keyring initialize step to wait until we're sure the FSM is current.

  • If a key is rotated immediately following a leader election, plans that are in-flight may get signed before the new leader has the key. Allow for a short timeout-and-retry to avoid rejecting plans

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/b-keyring-replication-limit/specially-easy-antelope branch from b6846af to 4e85ef5 Compare October 21, 2022 18:28
@hc-github-team-nomad-core hc-github-team-nomad-core merged commit cbea5f4 into release/1.4.x Oct 21, 2022
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/b-keyring-replication-limit/specially-easy-antelope branch October 21, 2022 18:28
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants