Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix panic from keyring raft entries being written during upgrade #14821

Merged
merged 8 commits into from
Oct 6, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Oct 6, 2022

Fixes #14819

During an upgrade to Nomad 1.4.0, if a server running 1.4.0 becomes the leader before one of the 1.3.x servers, the old server will crash because the keyring is initialized and writes a raft entry.

Wait until all members are on a version that supports the keyring before initializing it.

During an upgrade to Nomad 1.4.0, if a server running 1.4.0 becomes the leader
before one of the 1.3.x servers, the old server will crash because the keyring
is initialized and writes a raft entry.

Wait until all members are on a version that supports the keyring before
initializing it.
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I tested the patch locally through a previously failing scenario which didn't cause the non-1.4.0 servers to panic when a 1.4.0 server become leader.

$ nomad server members                                                                                                                     
Name            Address       Port  Status  Leader  Raft Version  Build      Datacenter  Region
server1.global  192.168.0.14  4648  alive   true    3             1.4.0-dev  dc1         global
server2.global  192.168.0.14  5648  alive   false   3             1.3.5      dc1         global
server3.global  192.168.0.14  6648  alive   false   3             1.4.0-dev  dc1         global

nomad/leader.go Outdated Show resolved Hide resolved
nomad/leader.go Outdated Show resolved Hide resolved
@tgross tgross merged commit 6e108d3 into main Oct 6, 2022
@tgross tgross deleted the b-upgrade-panic-keyring branch October 6, 2022 16:47
tgross added a commit that referenced this pull request Oct 17, 2022
In #14821 we fixed a panic that can happen if a leadership election happens in
the middle of an upgrade. That fix checks that all servers are at the minimum
version before initializing the keyring (which blocks evaluation processing
during trhe upgrade). But the check we implemented is over the serf membership,
which includes servers in any federated regions, which don't necessarily have
the same upgrade cycle.

Filter the version check by the leader's region.
tgross added a commit that referenced this pull request Oct 17, 2022
In #14821 we fixed a panic that can happen if a leadership election happens in
the middle of an upgrade. That fix checks that all servers are at the minimum
version before initializing the keyring (which blocks evaluation processing
during trhe upgrade). But the check we implemented is over the serf membership,
which includes servers in any federated regions, which don't necessarily have
the same upgrade cycle.

Filter the version check by the leader's region.
tgross added a commit that referenced this pull request Oct 17, 2022
In #14821 we fixed a panic that can happen if a leadership election happens in
the middle of an upgrade. That fix checks that all servers are at the minimum
version before initializing the keyring (which blocks evaluation processing
during trhe upgrade). But the check we implemented is over the serf membership,
which includes servers in any federated regions, which don't necessarily have
the same upgrade cycle.

Filter the version check by the leader's region.

Also bump up log levels of major keyring operations
@github-actions
Copy link

github-actions bot commented Feb 4, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.4.x backport to 1.4.x release line theme/crash theme/variables Variables feature type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad server 1.3.5 isn't able to join 1.4.0
3 participants