-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE-1198] cluster: cloud storage self test #17586
[CORE-1198] cluster: cloud storage self test #17586
Conversation
6adb3db
to
f867c05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need to read through it more detail, but its looking good. it will need to have tests added.
const cloud_storage_clients::bucket_name& bucket, | ||
const cloud_storage_clients::object_key& key) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general best practice unless performance sensitive is to have coroutines take parameters by value.
src/v/cluster/self_test_rpc_types.h
Outdated
// Amount of fibers to run per shard | ||
uint16_t parallelism{10}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be unused, did i miss it? separately, is it is desired for the cloud self-test to perform a benchmark vs only verify that connectivity and basic operations are working? i'm not sure what the right answer is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently this is unused- I also had the open question of whether more rigorous benchmarking across multiple shards was desired, or if just verifying connectivity with cloud storage was ideal.
45be6ca
to
2ebaec8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Looking pretty solid
0015d8e
to
bb13248
Compare
@@ -2740,17 +2740,20 @@ admin_server::self_test_start_handler(std::unique_ptr<ss::http::request> req) { | |||
r.dtos.push_back(cluster::diskcheck_opts::from_json(obj)); | |||
} else if (test_type == "network") { | |||
r.ntos.push_back(cluster::netcheck_opts::from_json(obj)); | |||
} else if (test_type == "cloud") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd recommend splitting out the changes to the admin server into a commit the comes right before the rpk changes that use those new http interfaces. makes it much easier to review.
src/v/cluster/self_test_rpc_types.cc
Outdated
case self_test_stage::cloud: | ||
return "cloud"; | ||
default: | ||
__builtin_unreachable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can probably drop builtin_unreachable if the cases are a covering set for the enum. if they aren't a covering set and you want to fail on default, then use vassert. if they are a covering set but you get a compiler warning, move builtin_unreachable outside the switch statement.
4459db0
to
c399af3
Compare
The cloud storage self test will be ran in the existing Let me know what you think! |
c399af3
to
79983dc
Compare
assert report['error'] == error_msg | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we also add an assertion that cloud_storage
is among the reports we've received?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added assert
statement to confirm number of received cloud storage reports.
tests/rptest/tests/self_test_test.py
Outdated
assert_fail( | ||
report, | ||
'Remote read is not enabled for this cluster.') | ||
elif report['info'] in ['upload', 'delete']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe make this an else
and then assert report['info'] in ['upload', 'delete']
? Just so any future cloud_storage report types that get added necessarily have to be considered here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed!
tests/rptest/tests/self_test_test.py
Outdated
remote_read = False | ||
remote_write = False | ||
if hasattr(ctx, 'injected_args') \ | ||
and ctx.injected_args is not None: | ||
if 'cloud_storage_enable_remote_read' in ctx.injected_args: | ||
remote_read = ctx.injected_args[ | ||
'cloud_storage_enable_remote_read'] | ||
if 'cloud_storage_enable_remote_write' in ctx.injected_args: | ||
remote_write = ctx.injected_args[ | ||
'cloud_storage_enable_remote_write'] | ||
|
||
super(SelfTestTest, self).__init__( | ||
test_context=ctx, | ||
si_settings=SISettings( | ||
test_context=ctx, | ||
cloud_storage_enable_remote_read=remote_read, | ||
cloud_storage_enable_remote_write=remote_write)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, there are other test base classes that allow you to start Redpanda in the test body. EndToEndTest is one example.
Feel free to keep this as is, but in case you want to avoid depending on low level things like injected_args, you could switch SelfTestTest over to using it and calling start_redpanda()
in the test body
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for letting me know about this! It's much nicer than dealing with injected_args
.
Fixed.
006c70b
to
38631cc
Compare
38631cc
to
4d1376e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh this looks awesome. i think if we squash the fixes back into previous commits this is looking like its pretty much ready.
* Cloud tests: | ||
* Latency test: 1024-bit object. | ||
* Depending on read/write permissions, a series of cloud storage operations are performed. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r-vasquez IIUC the way this will work is that if the cluster is already set up with TS then these tests will use those configurations.
The cloud storage check performs the following operations: - Upload an object to the configured S3 bucket - List from the bucket - Download from the bucket - Delete from the bucket The cluster self-test contains disk and network tests to validate and benchmark those subsystems. A test for cloud storage helps to ensure credentials and permissions have been correctly configured in redpanda.
Adds the `self_test::cloudcheck` object to `cluster::self_test_backend` so that it can be invoked alongside the existing self-test routines.
To allow for customization of cloud storage self-test options, `admin_server::self_test_start_handler()` will now be able to read options set in the JSON request created on the `rpk` side.
Adds cloud storage self-test bindings to `rpk`, allowing users to invoke the self-test. Also adds the following flags for use with the cloud storage test: - `--cloud-timeout-ms`, the timeout in ms for a cloud storage request - `--cloud-backoff-ms`, the backoff in ms for a cloud storage request - `--only-cloud-test`, in order to run only the cloud storage test
Add 'self_test_stage' as an indicator to user of which self-test routine is currently running. This commit appends the `self_test_stage` to the `self_test_backend` and `self_test_frontend`. Currently, a user can request the status of self-test in 'rpk' using 'cluster self-test status'. However, this only indicates whether a test is running on a node or if the tests are finished, but not which test is being ran. After adding bindings on the `rpk` side, this will enable users to see which self-test is currently running.
`admin_server` will now set the self test stage in its status reports presented to user.
`rpk` will now generate a report of running nodes with node ID, as well as the current stage of the self test on those running nodes. This results in more detailed status updates presented to the user when `cluster self-test status` is ran in `rpk`. User will now see which test is currently being run on which node, if a test is running.
Adds cloud storage self test result parsing to `SelfTestTest`. Also adds configuration for the cloud storage self-test to `clients/rpk.py`.
4d1376e
to
3ab7798
Compare
@@ -48,6 +51,15 @@ of the cluster. Available tests to run: | |||
* Unique pairs of Redpanda nodes each act as a client and a server. | |||
* The test pushes as much data over the wire, within the test parameters. | |||
|
|||
* Cloud tests: | |||
* Latency test: 1024-bit object. | |||
* Depending on cluster read/write permissions (cloud_storage_enable_remote_read, cloud_storage_enable_remote_write), a series of cloud storage operations are performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Depending on cluster read/write permissions (cloud_storage_enable_remote_read, cloud_storage_enable_remote_write), a series of cloud storage operations are performed. | |
* Depending on cluster read/write permissions (cloud_storage_enable_remote_read, cloud_storage_enable_remote_write), a series of cloud storage operations are performed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left minor punctuation update.
Extended the self test ducktape test coverage for all combinations of `cloud_storage_enable_remote_read` and `cloud_storage_enable_remote_write` permissions.
3ab7798
to
62dbcd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything looks good. just wondering about the necessity of using the e2e test fixture in ducktape.
Adds a cloud storage check to the cluster self test, as requested in #9225.
Depending on read/write permissions, the cloud storage test uses the
cloud_storage::remote
object configured at the application layer to:Backports Required
Release Notes
Features
To run, use
rpk cluster self-test start
. The following flags have also been added torpk
for use with the cloud storage test:--cloud-backoff-ms uint
, the backoff in milliseconds for a cloud storage request.--cloud-timeout-ms uint
, the timeout in milliseconds for a cloud storage request.--only-cloud-test
, in order to run only the cloud storage test.