-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ducktape test to execute workloads through multi-release upgrades #8253
Conversation
57e5257
to
4f2d12a
Compare
e6c2394
to
eb08344
Compare
eb08344
to
a55ba9c
Compare
e75422a
to
c086131
Compare
c086131
to
3178ba8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple thoughts on things to consider w.r.t versioning, nothing really blocking. I think I need to digest the workload protocol and adapters a bit more and let them sit, but at first glance this looks great!
3178ba8
to
877e100
Compare
877e100
to
d6acf1c
Compare
a662522
to
1e112fe
Compare
1e112fe
to
d1c1ead
Compare
d1c1ead
to
a246d09
Compare
https://buildkite.com/redpanda/redpanda/builds/32089#0188fed7-8481-4e6d-92ae-d25de019d22f and a failure related to the pr in debug mode:
edit: actually the timeout is before the check in the test itself, it's related to kgo consumer |
/ci-repeat 3 debug skip-debug dt-repeat=3 tests/rptest/tests/workload_upgrade_runner_test.py::RedpandaUpgradeTest.test_workloads_through_releases |
a246d09
to
1c0352a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test code looks fine to me. Does the new test pass reliably-ish?
timeout_sec=30, | ||
backoff_sec=1) | ||
|
||
old_version = current_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is this assignment needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old_version is passed on line 258 to mid_upgrade_check() , for the nodes that are not yet updated
head_line = self.head_version()[0:2] | ||
oldest_supported_line = (head_line[0] - 1, head_line[1]) | ||
latest_unsupported_line = (oldest_supported_line[0], | ||
oldest_supported_line[1] - 1) | ||
if latest_unsupported_line[1] == 0: | ||
# if going back, version vX.0 is v(X-1).3 | ||
latest_unsupported_line = (latest_unsupported_line[0] - 1, 3) | ||
return latest_unsupported_line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it would be nice to wrap this version back and forth in a type (i.e. RedpandaVersionTriple becomes a dataclass and this stuff is wrapped in a member)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pain with a dataclass is that we use "head" as a version in the codebase, and sometimes we need to convert it to and actual triple. That requires RedpandaInstaller to do it, and a query to a running instance of redpanda in principle, breaking a bit the encapsulation of a dataclass.
# old release have a bug where cloud_storage_enabled is sticky. forcing a leadership transfer is a workaround for this | ||
#Admin(self.ctx.redpanda).partition_transfer_leadership("kafka", self.topic.name, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember discussing this. Did this solution work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to remove the comment, but it didn't appear to work reliably. The solution used here is to just start with v22.3 for cloud storage
return PWorkload.DONE | ||
|
||
@abstractmethod | ||
def progress(self, version: RedpandaVersionTriple) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd rename this to something like on_cluster_upgraded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
""" | ||
return | ||
|
||
def partial_progress(self, versions: dict[Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd rename this to on_partial_cluster_upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
def partial_progress(self, versions: dict[Any, | ||
RedpandaVersionTriple]) -> int: | ||
""" | ||
This method is called while upgrading a cluster, after each node is upgraded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is this comment correct? Looks like this is called after half the cluster was upgraded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, comment out of sync
and some other python related typing
this callback is used to run checks mid-upgrade
This Protocol defines the interface that a workload has to implement to run inside the workload upgrade runner, introduced in the next commit. Only the methods marked as abstract needs to be implemented, the other are optional. additionally, tests/rptest/tests/workload_dummy.py provides an example implementation
faadc8d
to
be61a1f
Compare
force push: fixed a missing function rename in the code |
e7d7849
to
1c590e8
Compare
/ci-repeat 1 debug tests/rptest/tests/workload_producer_consumer.py |
the tests setup a producer and consumer, ensures that data gets written in cloud storage, and checks the content of the partition manifest to ensure progress
this test is a runner for a collection PWorkload. it will create an upgrade path, insert patch downgrades, setup a cluster and run the workloads concurrently against the cluster. at the end of the test it will report failed workloads. WorkloadAdapter is a wrapper to keep track workload state and to store any thrown exception
the method is used in a loop by RedpandaUpgradeTest and it's a bit noisy
/ci-repeat 1 debug tests/rptest/tests/workload_upgrade_runner_test.py |
1c590e8
to
c8d2a70
Compare
force push: comment fix and new commit to skip-debug since |
https://buildkite.com/redpanda/redpanda/builds/32226#0189080c-5ffe-40e2-b5be-257439b688c7 and one for this test, in debug mode. test is marked as @skip_debug_mode, so there is some weird interaction
edit: last issue was solved by moving @skip_debug_build to the top of the stack of decorators |
transition v23.1 -> v23.2 seems to be flaky in debug mode and a producer/consumer workload running
c8d2a70
to
0913b54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A framework to execute tests on a sequence of redpanda versions, composed of 2 components:
A Workload can implement PWorkload, spin up an external producer/consumer, and check progress on the external service in a stable and upgrading cluster.
Workloads run in parallel, so care should be taken not to disrupt other workloads, like changing cluster properties or removing nodes.
Various tests implementing a upgrading+testing pattern can be moved to this test to reduce the number of tests and total ci execution time while doing a more rigorous test.
Fixes #7310
Backports Required
UX Changes
Release Notes