Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: store and check for Raft version changes #12362

Merged
merged 3 commits into from
Mar 24, 2022
Merged

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Mar 23, 2022

Downgrading the Raft version protocol is not a supported operation.
Checking for a downgrade is hard since this information is not stored in
any persistent place. When a server re-joins a cluster with a prior Raft
version, the Serf tag is updated so Nomad can't tell that the version
changed.

Mixed version clusters must be supported to allow for zero-downtime
rolling upgrades. During this it's expected that the cluster will have
mixed Raft versions. Enforcing consistency strong version consistency
would disrupt this flow.

The approach taken here is to store the Raft version on disk. When the
server starts the raft_protocol value is written to the file
data_dir/raft/version. If that file already exists, its content is
checked against the current raft_protocol value to detect downgrades
and prevent the server from starting.

Any other types of errors are ignore to prevent disruptions that are
outside the control of operators. The only option in cases of an invalid
or corrupt file would be to delete it, making this check useless. So
just overwrite its content with the new version and provide guidance on
how to check that their cluster is an expected state.

Closes #11867

Downgrading the Raft version protocol is not a supported operation.
Checking for a downgrade is hard since this information is not stored in
any persistent place. When a server re-joins a cluster with a prior Raft
version, the Serf tag is updated so Nomad can't tell that the version
changed.

Mixed version clusters must be supported to allow for zero-downtime
rolling upgrades. During this it's expected that the cluster will have
mixed Raft versions. Enforcing consistency strong version consistency
would disrupt this flow.

The approach taken here is to store the Raft version on disk. When the
server starts the `raft_protocol` value is written to the file
`data_dir/raft/version`. If that file already exists, its content is
checked against the current `raft_protocol` value to detect downgrades
and prevent the server from starting.

Any other types of errors are ignore to prevent disruptions that are
outside the control of operators. The only option in cases of an invalid
or corrupt file would be to delete it, making this check useless. So
just overwrite its content with the new version and provide guidance on
how to check that their cluster is an expected state.
nomad/testing.go Outdated
return s, c
}

func TestServerWithErr(t *testing.T, cb func(*Config)) (*Server, func(), error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the best approach. I need to test the server doesn't start, but it was always failing the test due to the t.Fatalf.

I found this other approach, but it doesn't sound right for this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, my comments are mostly nitpicking over error messages 😀

nomad/server.go Outdated Show resolved Hide resolved
nomad/server.go Outdated Show resolved Hide resolved
nomad/server.go Outdated Show resolved Hide resolved
nomad/testing.go Outdated
return s, c
}

func TestServerWithErr(t *testing.T, cb func(*Config)) (*Server, func(), error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable

nomad/server.go Show resolved Hide resolved
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; just nitpicks

nomad/testing.go Outdated Show resolved Hide resolved
nomad/testing.go Outdated Show resolved Hide resolved
nomad/server_test.go Outdated Show resolved Hide resolved
nomad/server.go Outdated Show resolved Hide resolved
@lgfa29 lgfa29 merged commit 0783ac6 into main Mar 24, 2022
@lgfa29 lgfa29 deleted the f-store-raft-version branch March 24, 2022 18:42
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Server with raft protocol 2 joining cluster with raft protocol 3
3 participants