Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: Parse unknown self-test checks #22831

Merged

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Aug 10, 2024

To allow for compatibility between clusters of mixed version nodes (in the case of a rolling upgrade), we add the parse_unknown_checks() function to the self_test_backend.

There are two possibilities in a mixed version cluster:

  1. The rpk cluster self-test start request is sent to the controlling node, which is a higher version than the other nodes in the cluster. They do not recognize some of the tests in the start_test_request json, and push back an error to self_test_result. This case is covered by a previous PR, admin: remove throw on unknown self test type in admin_server #21370.
  2. The rpk cluster self-test start request is sent to the controlling node, which is a lower version than the other nodes in the cluster. Previously, this request would not be propagated to the other nodes, since it would not be recognized on the leader node. This case is covered in the current PR, in which we attempt to parse the unknown test with parse_unknown_checks() after serializing and sending the original JSON for the test to the followers.

This currently only affects the cloudcheck, since it was added between versions v24.1.x and v24.2.x. However, should more self tests be added in the future, they can be made version-compatible in the exact same way using parse_unknown_checks().

Also, fixes some serde::version<> bugs we had previously in various self-test related objects.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Improvements

  • Allows the self-test to be completely compatible with a mixed version cluster, in the case of a rolling upgrade.

Only applied to `[v24.2.x]` and `[dev]` branches. See
redpanda-data#22778 for change
to `[v24.1.x]`.

Also bumps `serde::version` of `start_test_request` to 2. This
`serde::version<>` needed bumping to `serde::version<2>`, since
`serde::version<0>` is the version of `start_test_request` with just
`diskcheck_opts` and `netcheck_opts`, while `serde::version<1>` is the
version with `unknown_checks` added, and `serde::version<2>` needs to
reflect the most recently added `cloudcheck_opts`.

Also, re-order fields in `start_test_request` to allow serde to work
correctly in the interim.
In the case that the controller node which originally parsed the
`rpk cluster self-test start` command was of a lower version than
the follower nodes in a cluster, it is possible that some tests
may not have been recognized by the controller, and instead pushed
back into `start_test_request.unknown_checks`. On these upgraded
follower nodes, they will attempt to parse any `unknown_checks`, in
case they recognize them.

This is currently only affecting the `cloudcheck` self-test, in mixed
clusters with versions before and after `v24.2.x`. However, any future
self-tests that get added to `redpanda` should also add their case to
`parse_unknown_checks()` for completion's sake.
This struct was erroneously altered by adding a new field without bumping
the `serde::version<>`. This struct also lacks a `serde_fields()` implementation,
so adding the new field anywhere but the end of the structure is an issue
for serialization/deserialization.

Update it by moving the field to the end of the struct, and bumping the
`serde::version<>` to 1.
NFCs in this commit.
@WillemKauf WillemKauf force-pushed the unknown_self_test_check_parse branch from 34a7e98 to a33abd7 Compare August 10, 2024 01:16
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 10, 2024

Clean up the `self_test_backend` and `admin/server` parse sites
for self-test JSON parsing by adding `parse_self_test_checks()`.

This will now be the function called from `server::self_test_start_handler()`
as well as `self_test_backend::do_start_test()`.

Also, rename `unknown_check` to `unparsed_check` for clarity. These
unparsed checks are still considered "unknown" if they exist by the
end of the self-test.
@WillemKauf WillemKauf requested a review from andrwng August 16, 2024 00:19
@WillemKauf WillemKauf merged commit cf7db12 into redpanda-data:dev Aug 16, 2024
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

Comment on lines 344 to 356
serde::version<1>,
serde::version<2>,
serde::compat_version<0>> {
using rpc_adl_exempt = std::true_type;

uuid_t id;
std::vector<diskcheck_opts> dtos;
std::vector<netcheck_opts> ntos;
std::vector<cloudcheck_opts> ctos;
std::vector<unknown_check> unknown_checks;
std::vector<cloudcheck_opts> ctos;

auto serde_fields() {
return std::tie(id, dtos, ntos, unknown_checks, ctos);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillemKauf i'm confused about what is going on here, and i can't figure out the answer from the commit message.

looking at the git history i see when version was 0, the order of serde fields in the struct was d,n,c. then it looks like it was bumped to 1 when the unknown checks were appended, which seems correct.

but then here, the commit appears to claim that a bug is being fixed (i'm not sure what the bug was), and also, the ordering of the fields was changed and that's not allowed even when bumping the version.

Comment on lines -381 to +388
serde::version<0>,
serde::version<1>,
serde::compat_version<0>> {
using rpc_adl_exempt = std::true_type;

uuid_t id{};
self_test_status status{};
self_test_stage stage{};
std::vector<self_test_result> results;
self_test_stage stage{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one seems more problematic because the ordering prior to this commit was both wrong, but also probably shipped in a release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants