Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keyring: filter by region before checking version #14901

Merged
merged 2 commits into from
Oct 17, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Oct 17, 2022

In #14821 we fixed a panic that can happen if a leadership election happens in the middle of an upgrade. That fix checks that all servers are at the minimum version before initializing the keyring (which blocks evaluation processing during trhe upgrade). But the check we implemented is over the serf membership, which includes servers in any federated regions, which don't necessarily have the same upgrade cycle.

Filter the version check by the leader's region.

Fixes #14896 but this fix makes me wonder if we have other lurking bugs in this version check.

In #14821 we fixed a panic that can happen if a leadership election happens in
the middle of an upgrade. That fix checks that all servers are at the minimum
version before initializing the keyring (which blocks evaluation processing
during trhe upgrade). But the check we implemented is over the serf membership,
which includes servers in any federated regions, which don't necessarily have
the same upgrade cycle.

Filter the version check by the leader's region.
@tgross
Copy link
Member Author

tgross commented Oct 17, 2022

(An alternate approach here could be to get the localPeers as done in the keyring replication:

nomad/nomad/encrypter.go

Lines 514 to 523 in 9cdbbbb

// TODO: move this method into Server?
func (krr *KeyringReplicator) getAllPeers() []*serverParts {
krr.srv.peerLock.RLock()
defer krr.srv.peerLock.RUnlock()
peers := make([]*serverParts, 0, len(krr.srv.localPeers))
for _, peer := range krr.srv.localPeers {
peers = append(peers, peer.Copy())
}
return peers
}
)

@tgross
Copy link
Member Author

tgross commented Oct 17, 2022

Tests are green now, but I got this failure twice... it looks like a flake and but it also looks like it's potentially relevant, but I can't figure out where it's coming from yet.

=== RUN   TestServer_ReconcileMember
    leader_test.go:1463: found leader: f00dc7a1-7097-8bbe-2659-ecdcd4ff3622 127.0.0.1:9391
    leader_test.go:1472: found leader: f00dc7a1-7097-8bbe-2659-ecdcd4ff3622 127.0.0.1:9391
Error: 3 2022-10-17T14:37:55.265Z [ERROR] raft-net: failed to flush response: error="write tcp 127.0.0.1:9387->127.0.0.1:35878: write: broken pipe"
    leader_test.go:1496: got 2 server ids want 3: []raft.Server{raft.Server{Suffrage:0, ID:"f00dc7a1-7097-8bbe-2659-ecdcd4ff3622", Address:"127.0.0.1:9391"}, raft.Server{Suffrage:0, ID:"cefda963-7d8f-9a8c-131d-bec0073fe8c2", Address:"127.0.0.1:9383"}}
--- FAIL: TestServer_ReconcileMember (0.19s)

Edit: nevermind, this is in CircleCI's flaky test set

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline we need to check our other uses of ServersMeetMinimumVersion and ensure if they're local-only they do this region check, but if they're global (eg ACL related objects) they need to check all members.

Great find on this, I can't believe it hasn't bitten us before. Although I wonder if it has in such a subtle way that we didn't realize it. In other spots this would more or less silently prevent using a new code path.

@tgross
Copy link
Member Author

tgross commented Oct 17, 2022

I'm going to merge this one, and then take a pass thru the existing uses of ServersMeetMinimumVersion and will push up a separate PR for that (because that'll likely need different backporting anyways). We'll also want to pick up #14908 for 1.4.2

@tgross tgross merged commit f6838f6 into main Oct 17, 2022
@tgross tgross deleted the b-region-filter-server-version-check branch October 17, 2022 17:21
@tgross tgross modified the milestones: 1.4.x, 1.4.2 Oct 17, 2022
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.4.x backport to 1.4.x release line theme/keyring type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not upgrading servers across all regions to 1.4 causes "keyring has not been initialized yet"
2 participants